Re: [Qemu-block] [Qemu-devel] [RFC PATCH 0/5] atapi: change unlimited recursion to while loop

2018-02-27 Thread John Snow


On 02/23/2018 10:26 AM, Paolo Bonzini wrote:
> Real hardware doesn't have an unlimited stack, so the unlimited
> recursion in the ATAPI code smells a bit.  In fact, the call to
> ide_transfer_start easily becomes a tail call with a small change
> to the code (patch 4).  The remaining four patches move code around
> so as to the turn the call back to ide_atapi_cmd_reply_end into
> another tail call, and then convert the (double) tail recursion into
> a while loop.
> 
> I'm not sure how this can be tested, apart from adding a READ CD
> test to ahci-test (which I don't really have time for now, hence
> the RFC tag).  The existing AHCI tests still pass, so patches 1-3
> aren't complete crap.
> 
> Paolo
> 
> Paolo Bonzini (5):
>   ide: push call to end_transfer_func out of start_transfer callback
>   ide: push end_transfer callback to ide_transfer_halt
>   ide: make ide_transfer_stop idempotent
>   atapi: call ide_set_irq before ide_transfer_start
>   ide: introduce ide_transfer_start_norecurse
> 
>  hw/ide/ahci.c | 12 +++-
>  hw/ide/atapi.c| 37 -
>  hw/ide/core.c | 37 +++--
>  include/hw/ide/internal.h |  3 +++
>  4 files changed, 53 insertions(+), 36 deletions(-)
> 

ACK receipt, I will get to this soon, sorry!



Re: [Qemu-block] [PATCH] vl: introduce vm_shutdown()

2018-02-27 Thread Fam Zheng
On Tue, 02/27 15:30, Stefan Hajnoczi wrote:
> On Fri, Feb 23, 2018 at 04:20:44PM +0800, Fam Zheng wrote:
> > On Tue, 02/20 13:10, Stefan Hajnoczi wrote:
> > > 1. virtio_scsi_handle_cmd_vq() racing with iothread_stop_all() hits the
> > >virtio_scsi_ctx_check() assertion failure because the BDS AioContext
> > >has been modified by iothread_stop_all().
> > 
> > Does this patch fix the issue completely? IIUC virtio_scsi_handle_cmd can
> > already be entered at the time of main thread calling 
> > virtio_scsi_clear_aio(),
> > so this race condition still exists:
> > 
> >   main thread   iothread
> > -
> >   vm_shutdown
> > ...
> >   virtio_bus_stop_ioeventfd
> > virtio_scsi_dataplane_stop
> > aio_poll()
> >   ...
> > 
> > virtio_scsi_data_plane_handle_cmd()
> >   aio_context_acquire(s->ctx)
> >   virtio_scsi_acquire(s).enter
> >   virtio_scsi_clear_aio()
> >   aio_context_release(s->ctx)
> >   virtio_scsi_acquire(s).return
> >   virtio_scsi_handle_cmd_vq()
> > ...
> >   virtqueue_pop()
> > 
> > Is it possible that the above virtqueue_pop() still returns one element 
> > that was
> > queued before vm_shutdown() was called?
> 
> No, it can't because virtio_scsi_clear_aio() invokes
> virtio_queue_host_notifier_aio_read(>host_notifier) to process the
> virtqueue.  By the time we get back to iothread's
> virtio_scsi_data_plane_handle_cmd() the virtqueue is already empty.
> 
> Vcpus have been paused so no additional elements can slip into the
> virtqueue.

So there is:

static void virtio_queue_host_notifier_aio_read(EventNotifier *n)
{
VirtQueue *vq = container_of(n, VirtQueue, host_notifier);
if (event_notifier_test_and_clear(n)) {
virtio_queue_notify_aio_vq(vq);
}
}

Guest kicks after adding an element to VQ, but we check ioeventfd before trying
virtqueue_pop(). Is that a problem? If VCPUs are paused after enqueuing but
before kicking VQ, the ioeventfd is not set, the virtqueue is not processed
here.

Fam




Re: [Qemu-block] [Qemu-devel] [RFC v4 19/21] blockjobs: Expose manual property

2018-02-27 Thread Eric Blake

On 02/27/2018 03:57 PM, John Snow wrote:



On 02/27/2018 03:16 PM, Eric Blake wrote:

On 02/23/2018 05:51 PM, John Snow wrote:

Expose the "manual" property via QAPI for the backup-related jobs.
As of this commit, this allows the management API to request the
"concluded" and "dismiss" semantics for backup jobs.



+# @manual: True to use an expanded, more explicit job control flow.
+#  Jobs may transition from a running state to a pending state,
+#  where they must be instructed to complete manually via
+#  block-job-finalize.
+#  Jobs belonging to a transaction must either all or all not
use this
+#  setting. Once a transaction reaches a pending state,
issuing the
+#  finalize command to any one job in the transaction is
sufficient
+#  to finalize the entire transaction.


The previous commit message talked about mixed-manual transactions, but
this seems to imply it is not possible.  I'm fine if we don't support
mixed-manual transactions, but wonder if it means any changes to the
series.

Otherwise looks reasonable from the UI point of view.



More seriously, this documentation I wrote doesn't address the totality
of the expanded flow. I omitted dismiss here by accident as well. This
is at best a partial definition of the 'manual' property.

I'd like to use _this_ patch to ask the question: "What should the
proper noun for the QEMU 2.12+ Expanded Block Job Management Flow
Mechanism be?"


"Manual" actually doesn't sound too bad; I could also see "Explicit job 
flow", as in, "within a transaction, all jobs should have the same 
setting for the choice of Explicit Job Flow" (but then the name 'manual' 
would have to be changed to match).  The idea of a central document, 
that gets referred to from multiple spots in the QAPI docs, rather than 
duplicating information throughout the QAPI docs, is reasonable.




I'm not too sure, but "Manual mode" leaves a lot to be desired.

I keep calling it something like "2.12+ Job Management" but that's not
really descriptive.


That, and if someone ever backports the enhanced state machine to a 2.11 
branch, it becomes a misnomer.



I conceptualize the feature as the addition of a
purposefully more "needy" and less automatic completion mechanism, hence
the "manual"

Anyway, I'd like to figure out a good "documentation name" for it so I
can point all instances of the creation property (for drive-backup,
drive-mirror, and everyone else) to a central location that explains the
STM and what exactly the differences between manual=on/off are. I'd then
like to expose this property via query and link the documentation there
to this description, too.


"Explicit" and "Manual" are the two best options coming to me as I type 
this email.




It'd be nice-- under the same arguments that prompted 'dismiss'-- to say
that if a client crashes it can reconnect and discover what kind of
attention certain jobs will need by asking for the manual property back.

--js



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 19/21] blockjobs: Expose manual property

2018-02-27 Thread John Snow


On 02/27/2018 03:16 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> Expose the "manual" property via QAPI for the backup-related jobs.
>> As of this commit, this allows the management API to request the
>> "concluded" and "dismiss" semantics for backup jobs.
>>
>> Signed-off-by: John Snow 
>> ---
>>   blockdev.c   | 19 ---
>>   qapi/block-core.json | 32 ++--
>>   2 files changed, 42 insertions(+), 9 deletions(-)
>>
> 
>> +++ b/qapi/block-core.json
>> @@ -1177,6 +1177,16 @@
>>   # @job-id: identifier for the newly-created block job. If
>>   #  omitted, the device name will be used. (Since 2.7)
>>   #
>> +# @manual: True to use an expanded, more explicit job control flow.
>> +#  Jobs may transition from a running state to a pending state,
>> +#  where they must be instructed to complete manually via
>> +#  block-job-finalize.
>> +#  Jobs belonging to a transaction must either all or all not
>> use this
>> +#  setting. Once a transaction reaches a pending state,
>> issuing the
>> +#  finalize command to any one job in the transaction is
>> sufficient
>> +#  to finalize the entire transaction.
> 
> The previous commit message talked about mixed-manual transactions, but
> this seems to imply it is not possible.  I'm fine if we don't support
> mixed-manual transactions, but wonder if it means any changes to the
> series.
> 
> Otherwise looks reasonable from the UI point of view.
> 

More seriously, this documentation I wrote doesn't address the totality
of the expanded flow. I omitted dismiss here by accident as well. This
is at best a partial definition of the 'manual' property.

I'd like to use _this_ patch to ask the question: "What should the
proper noun for the QEMU 2.12+ Expanded Block Job Management Flow
Mechanism be?"

I'm not too sure, but "Manual mode" leaves a lot to be desired.

I keep calling it something like "2.12+ Job Management" but that's not
really descriptive. I conceptualize the feature as the addition of a
purposefully more "needy" and less automatic completion mechanism, hence
the "manual"

Anyway, I'd like to figure out a good "documentation name" for it so I
can point all instances of the creation property (for drive-backup,
drive-mirror, and everyone else) to a central location that explains the
STM and what exactly the differences between manual=on/off are. I'd then
like to expose this property via query and link the documentation there
to this description, too.

It'd be nice-- under the same arguments that prompted 'dismiss'-- to say
that if a client crashes it can reconnect and discover what kind of
attention certain jobs will need by asking for the manual property back.

--js



Re: [Qemu-block] [Qemu-devel] [RFC v4 00/21] blockjobs: add explicit job management

2018-02-27 Thread John Snow


On 02/24/2018 09:31 AM, no-re...@patchew.org wrote:
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 20180223235142.21501-1-js...@redhat.com
> Subject: [Qemu-devel] [RFC v4 00/21] blockjobs: add explicit job management
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> 
> BASE=base
> n=1
> total=$(git log --oneline $BASE.. | wc -l)
> failed=0
> 
> git config --local diff.renamelimit 0
> git config --local diff.renames True
> git config --local diff.algorithm histogram
> 
> commits="$(git log --format=%H --reverse $BASE..)"
> for c in $commits; do
> echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
> if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; 
> then
> failed=1
> echo
> fi
> n=$((n+1))
> done
> 
> exit $failed
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> Switched to a new branch 'test'
> 230e578fa2 blockjobs: add manual_mgmt option to transactions
> f278a5155a iotests: test manual job dismissal
> 8e473ab4a8 blockjobs: Expose manual property
> 7ad2d0164f blockjobs: add block-job-finalize
> 3857c91315 blockjobs: add PENDING status and event
> 18eb8a4130 blockjobs: add waiting status
> daf9613432 blockjobs: add prepare callback
> 78be501212 blockjobs: add block_job_txn_apply function
> 4b659abe69 blockjobs: add commit, abort, clean helpers
> 4023046d76 blockjobs: ensure abort is called for cancelled jobs
> e9300b122e blockjobs: add block_job_dismiss
> 4fc045eae4 blockjobs: add NULL state
> e6aa454753 blockjobs: add CONCLUDED state
> 78efa2f937 blockjobs: add ABORTING state
> 057ad2472f blockjobs: add block_job_verb permission table
> c62c5b75a3 iotests: add pause_wait
> 4aadb9c38c blockjobs: add state transition table
> afc594c4b0 blockjobs: add status enum
> 434d3811fa blockjobs: add manual property
> fc3e3eebc9 blockjobs: model single jobs as transactions
> 8d32662676 blockjobs: fix set-speed kick
> 
> === OUTPUT BEGIN ===
> Checking PATCH 1/21: blockjobs: fix set-speed kick...
> Checking PATCH 2/21: blockjobs: model single jobs as transactions...
> Checking PATCH 3/21: blockjobs: add manual property...
> Checking PATCH 4/21: blockjobs: add status enum...
> Checking PATCH 5/21: blockjobs: add state transition table...
> ERROR: space prohibited before open square bracket '['
> #81: FILE: blockjob.c:48:
> +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0},
> 
> ERROR: space prohibited before open square bracket '['
> #82: FILE: blockjob.c:49:
> +/* C: */ [BLOCK_JOB_STATUS_CREATED]   = {0, 0, 1, 0, 0, 0},
> 
> ERROR: space prohibited before open square bracket '['
> #83: FILE: blockjob.c:50:
> +/* R: */ [BLOCK_JOB_STATUS_RUNNING]   = {0, 0, 0, 1, 1, 0},
> 
> ERROR: space prohibited before open square bracket '['
> #84: FILE: blockjob.c:51:
> +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0},
> 
> ERROR: space prohibited before open square bracket '['
> #85: FILE: blockjob.c:52:
> +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1},
> 
> ERROR: space prohibited before open square bracket '['
> #86: FILE: blockjob.c:53:
> +/* S: */ [BLOCK_JOB_STATUS_STANDBY]   = {0, 0, 0, 0, 1, 0},
> 
> total: 6 errors, 0 warnings, 90 lines checked
> 
> Your patch has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 

-EWONTFIX unless someone screams louder than me.

> Checking PATCH 6/21: iotests: add pause_wait...
> Checking PATCH 7/21: blockjobs: add block_job_verb permission table...
> Checking PATCH 8/21: blockjobs: add ABORTING state...
> ERROR: space prohibited before open square bracket '['
> #61: FILE: blockjob.c:48:
> +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0, 0},
> 
> ERROR: space prohibited before open square bracket '['
> #62: FILE: blockjob.c:49:
> +/* C: */ [BLOCK_JOB_STATUS_CREATED]   = {0, 0, 1, 0, 0, 0, 0},
> 
> ERROR: space prohibited before open square bracket '['
> #63: FILE: blockjob.c:50:
> +/* R: */ [BLOCK_JOB_STATUS_RUNNING]   = {0, 0, 0, 1, 1, 0, 1},
> 
> ERROR: space prohibited before open square bracket '['
> #64: FILE: blockjob.c:51:
> +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0, 0},
> 
> ERROR: space prohibited before open square bracket '['
> #65: FILE: blockjob.c:52:
> +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1, 1},
> 
> ERROR: space prohibited before open square bracket '['
> #66: FILE: blockjob.c:53:
> +/* S: */ [BLOCK_JOB_STATUS_STANDBY]   = {0, 0, 0, 0, 1, 0, 0},
> 
> ERROR: space prohibited before open square bracket '['
> #67: FILE: blockjob.c:54:
> +/* X: */ [BLOCK_JOB_STATUS_ABORTING]  = {0, 0, 0, 0, 0, 0, 0},
> 
> total: 7 errors, 0 warnings, 62 lines checked

-EWONTFIX

> 
> Your patch has style problems, please review.  If any of these errors
> are false positives report 

Re: [Qemu-block] [Qemu-devel] [RFC v4 00/21] blockjobs: add explicit job management

2018-02-27 Thread John Snow


On 02/25/2018 06:25 PM, no-re...@patchew.org wrote:
>   CC  scsi/qemu-pr-helper.o
> /var/tmp/patchew-tester-tmp-9ukz0kme/src/blockjob.c: In function 
> ‘block_job_txn_apply.isra.8’:
> /var/tmp/patchew-tester-tmp-9ukz0kme/src/blockjob.c:511:5: error: ‘rc’ may be 
> used uninitialized in this function [-Werror=maybe-uninitialized]
>  return rc;
>  ^
>   CC  qemu-bridge-helper.o
>   CC  blockdev.o
>   CC  blockdev-nbd.o
>   CC  bootdevice.o
>   CC  iothread.o
>   CC  qdev-monitor.o
>   CC  device-hotplug.o
>   CC  os-posix.o
>   CC  bt-host.o
>   CC  bt-vhci.o
>   CC  dma-helpers.o
>   CC  vl.o
>   CC  tpm.o
>   CC  device_tree.o
>   CC  qmp-marshal.o
>   CC  qmp.o
>   CC  hmp.o
>   CC  cpus-common.o
>   CC  audio/audio.o
> cc1: all warnings being treated as errors
> make: *** [blockjob.o] Error 1
> make: *** Waiting for unfinished jobs
> === OUTPUT END ===
> 
> Test command exited with code: 2
> 
> 
> ---
> Email generated automatically by Patchew [http://patchew.org/].
> Please send your feedback to patchew-de...@freelists.org

No idea why this would only trigger on ppc/be, but I'll look into it.



Re: [Qemu-block] [Qemu-devel] [RFC v4 17/21] blockjobs: add PENDING status and event

2018-02-27 Thread John Snow


On 02/27/2018 03:05 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> For jobs utilizing the new manual workflow, we intend to prohibit
>> them from modifying the block graph until the management layer provides
>> an explicit ACK via block-job-finalize to move the process forward.
>>
>> To distinguish this runstate from "ready" or "waiting," we add a new
>> "pending" event.
>>
>> For now, the transition from PENDING to CONCLUDED/ABORTING is automatic,
>> but a future commit will add the explicit block-job-finalize step.
>>
>> Transitions:
>> Waiting -> Pending:   Normal transition.
>> Pending -> Concluded: Normal transition.
>> Pending -> Aborting:  Late transactional failures and cancellations.
>>
>> Removed Transitions:
>> Waiting -> Concluded: Jobs must go to PENDING first.
>>
>> Verbs:
>> Cancel: Can be applied to a pending job.
>>
> 
>> +##
>> +# @BLOCK_JOB_PENDING:
>> +#
>> +# Emitted when a block job is awaiting explicit authorization to
>> finalize graph
>> +# changes via @block-job-finalize. If this job is part of a
>> transaction, it will
>> +# not emit this event until the transaction has converged first.
> 
> Same question of whether this new event is always emitted (and older
> clients presumably ignore it), or only emitted for clients that
> requested new-style state management.
> 

Old style jobs will skip the broadcast of the event, but will still
transition to the state. However, since transition is synchronous, you
likely won't see this state show up in a query for old style jobs.

That was the intent, anyway.

I wanted to be nonintrusive, and felt that this event was likely not
useful in any way unless we were using the new state management scheme.
In the old style, this event will be fully synchronous with COMPLETED or
CANCELLED, for instance.

--js



Re: [Qemu-block] [Qemu-devel] [RFC v4 16/21] blockjobs: add waiting status

2018-02-27 Thread John Snow


On 02/27/2018 03:00 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> For jobs that are stuck waiting on others in a transaction, it would
>> be nice to know that they are no longer "running" in that sense, but
>> instead are waiting on other jobs in the transaction.
>>
>> Jobs that are "waiting" in this sense cannot be meaningfully altered
>> any longer as they have left their running loop. The only meaningful
>> user verb for jobs in this state is "cancel," which will cancel the
>> whole transaction, too.
>>
>> Transitions:
>> Running -> Waiting:   Normal transition.
>> Ready   -> Waiting:   Normal transition.
>> Waiting -> Aborting:  Transactional cancellation.
>> Waiting -> Concluded: Normal transition.
>>
>> Removed Transitions:
>> Running -> Concluded: Jobs must go to WAITING first.
>> Ready   -> Concluded: Jobs must go to WAITING fisrt.
> 
> s/fisrt/first/
> 
>> +++ b/blockjob.c
> 
>> @@ -3934,6 +3938,29 @@
>>   'offset': 'int',
>>   'speed' : 'int' } }
>>   +##
>> +# @BLOCK_JOB_WAITING:
>> +#
>> +# Emitted when a block job that is part of a transction has stopped
>> work and is
> 
> s/transction/transaction/
> 
>> +# waiting for other jobs in the transaction to reach the same state.
> 
> Is this event emitted only for 'new-style' transactions (old drivers
> will never see it, because they don't request new style), or always (old
> drivers will see, but presumably ignore, it)?
> 

...! Actually, I meant to remove the WAITING *event* entirely, this is a
mistake.

It's purely an informational state that clients likely cannot make any
real use of, because regardless of old or new style, jobs will
transition automatically to "PENDING."

That said, old or new, the state is visible from query-block-jobs.

--js



Re: [Qemu-block] [Qemu-devel] [RFC v4 15/21] blockjobs: add prepare callback

2018-02-27 Thread John Snow


On 02/27/2018 02:56 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> Some jobs upon finalization may need to perform some work that can
>> still fail. If these jobs are part of a transaction, it's important
>> that these callbacks fail the entire transaction.
>>
>> We allow for a new callback in addition to commit/abort/clean that
>> allows us the opportunity to have fairly late-breaking failures
>> in the transactional process.
>>
>> The expected flow is:
>>
>> - All jobs in a transaction converge to the WAITING state
>>    (added in a forthcoming commit)
>> - All jobs prepare to call either commit/abort
>> - If any job fails, is canceled, or fails preparation, all jobs
>>    call their .abort callback.
>> - All jobs enter the PENDING state, awaiting manual intervention
>>    (also added in a forthcoming commit)
>> - block-job-finalize is issued by the user/management layer
>> - All jobs call their commit callbacks.
>>
>> Signed-off-by: John Snow 
>> ---
>>   blockjob.c   | 34 +++---
>>   include/block/blockjob_int.h | 10 ++
>>   2 files changed, 41 insertions(+), 3 deletions(-)
>>
> 
>> @@ -467,17 +480,22 @@ static void block_job_cancel_async(BlockJob *job)
>>   job->cancelled = true;
>>   }
>>   -static void block_job_txn_apply(BlockJobTxn *txn, void fn(BlockJob *))
>> +static int block_job_txn_apply(BlockJobTxn *txn, int fn(BlockJob *))
>>   {
>>   AioContext *ctx;
>>   BlockJob *job, *next;
>> +    int rc;
>>     QLIST_FOREACH_SAFE(job, >jobs, txn_list, next) {
>>   ctx = blk_get_aio_context(job->blk);
>>   aio_context_acquire(ctx);
>> -    fn(job);
>> +    rc = fn(job);
>>   aio_context_release(ctx);
>> +    if (rc) {
>> +    break;
>> +    }
> 
> This short-circuits the application of the function to the rest of the
> group.  Is that ever going to be a problem?
> 

With what I've written, I don't think so -- but I can't guarantee
someone won't misunderstand the semantics of it and it will become a
problem. It is a potentially dangerous function in that way.

--js



Re: [Qemu-block] [Qemu-devel] [RFC v4 12/21] blockjobs: ensure abort is called for cancelled jobs

2018-02-27 Thread John Snow


On 02/27/2018 02:49 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> Presently, even if a job is canceled post-completion as a result of
>> a failing peer in a transaction, it will still call .commit because
>> nothing has updated or changed its return code.
>>
>> The reason why this does not cause problems currently is because
>> backup's implementation of .commit checks for cancellation itself.
>>
>> I'd like to simplify this contract:
>>
>> (1) Abort is called if the job/transaction fails
>> (2) Commit is called if the job/transaction succeeds
>>
>> To this end: A job's return code, if 0, will be forcibly set as
>> -ECANCELED if that job has already concluded. Remove the now
>> redundant check in the backup job implementation.
>>
>> We need to check for cancellation in both block_job_completed
>> AND block_job_completed_single, because jobs may be cancelled between
>> those two calls; for instance in transactions.
>>
>> The check in block_job_completed could be removed, but there's no
>> point in starting to attempt to succeed a transaction that we know
>> in advance will fail.
>>
>> This does NOT affect mirror jobs that are "canceled" during their
>> synchronous phase. The mirror job itself forcibly sets the canceled
>> property to false prior to ceding control, so such cases will invoke
>> the "commit" callback.
>>
>> Signed-off-by: John Snow 
>> ---
>>   block/backup.c |  2 +-
>>   block/trace-events |  1 +
>>   blockjob.c | 19 +++
>>   3 files changed, 17 insertions(+), 5 deletions(-)
> 
> More lines of code, but the contract does seem simpler and useful for
> the later patches.
> 

Unfortunately yes, but it would be less overall if more than Backup made
use of commit/abort, which I anticipate in the future when I try to
start switching jobs over to using the .prepare callback.

It was just genuinely shocking to me that we'd call commit(), but backup
secretly knew we wanted abort. That type of logic belongs in the core
dispatch layer.

> Reviewed-by: Eric Blake 
> 

Thanks for the reviews!



Re: [Qemu-block] [Qemu-devel] [RFC v4 20/21] iotests: test manual job dismissal

2018-02-27 Thread John Snow


On 02/27/2018 03:21 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> Signed-off-by: John Snow 
>> ---
>>   tests/qemu-iotests/056 | 195
>> +
>>   tests/qemu-iotests/056.out |   4 +-
>>   2 files changed, 197 insertions(+), 2 deletions(-)
>>
> 
> I'm not sure if this covers everything in the series, but it looks like

It definitely doesn't!

> a reasonable expansion and hits a lot of the highlights.  At any rate,
> it's always better to add tests, and the test passing is a good bet that
> the new code will be harder to regress.
> 

More to follow, but I was afraid of wasting time if this series didn't
look on the whole serviceable.

I'll probably focus efforts on expanding blockjob-txn and blockjob unit
tests.

> Reviewed-by: Eric Blake 
> 



Re: [Qemu-block] [Qemu-devel] [RFC v4 19/21] blockjobs: Expose manual property

2018-02-27 Thread John Snow


On 02/27/2018 03:16 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> Expose the "manual" property via QAPI for the backup-related jobs.
>> As of this commit, this allows the management API to request the
>> "concluded" and "dismiss" semantics for backup jobs.
>>
>> Signed-off-by: John Snow 
>> ---
>>   blockdev.c   | 19 ---
>>   qapi/block-core.json | 32 ++--
>>   2 files changed, 42 insertions(+), 9 deletions(-)
>>
> 
>> +++ b/qapi/block-core.json
>> @@ -1177,6 +1177,16 @@
>>   # @job-id: identifier for the newly-created block job. If
>>   #  omitted, the device name will be used. (Since 2.7)
>>   #
>> +# @manual: True to use an expanded, more explicit job control flow.
>> +#  Jobs may transition from a running state to a pending state,
>> +#  where they must be instructed to complete manually via
>> +#  block-job-finalize.
>> +#  Jobs belonging to a transaction must either all or all not
>> use this
>> +#  setting. Once a transaction reaches a pending state,
>> issuing the
>> +#  finalize command to any one job in the transaction is
>> sufficient
>> +#  to finalize the entire transaction.
> 
> The previous commit message talked about mixed-manual transactions, but
> this seems to imply it is not possible.  I'm fine if we don't support
> mixed-manual transactions, but wonder if it means any changes to the
> series.
> 
> Otherwise looks reasonable from the UI point of view.
> 

Refactor hell.

The intent (and my belief) is that as of right now you CAN mix them. In
earlier drafts, it was not always the case.



Re: [Qemu-block] [Qemu-devel] [RFC v4 21/21] blockjobs: add manual_mgmt option to transactions

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

This allows us to easily force the option for all jobs belonging
to a transaction to ensure consistency with how all those jobs
will be handled.

This is purely a convenience.

Signed-off-by: John Snow 
---



+++ b/qapi/transaction.json
@@ -79,7 +79,8 @@
  ##
  { 'struct': 'TransactionProperties',
'data': {
-   '*completion-mode': 'ActionCompletionMode'
+   '*completion-mode': 'ActionCompletionMode',
+   '*manual-mgmt': 'bool'


Missing QAPI documentation (what you have elsewhere in the C code can 
probably be copied here, though).


The UI aspect makes sense (I can declare one manual at the transaction 
level instead of multiple manual declarations per member level within 
the transaction).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 20/21] iotests: test manual job dismissal

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Signed-off-by: John Snow 
---
  tests/qemu-iotests/056 | 195 +
  tests/qemu-iotests/056.out |   4 +-
  2 files changed, 197 insertions(+), 2 deletions(-)



I'm not sure if this covers everything in the series, but it looks like 
a reasonable expansion and hits a lot of the highlights.  At any rate, 
it's always better to add tests, and the test passing is a good bet that 
the new code will be harder to regress.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 19/21] blockjobs: Expose manual property

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Expose the "manual" property via QAPI for the backup-related jobs.
As of this commit, this allows the management API to request the
"concluded" and "dismiss" semantics for backup jobs.

Signed-off-by: John Snow 
---
  blockdev.c   | 19 ---
  qapi/block-core.json | 32 ++--
  2 files changed, 42 insertions(+), 9 deletions(-)




+++ b/qapi/block-core.json
@@ -1177,6 +1177,16 @@
  # @job-id: identifier for the newly-created block job. If
  #  omitted, the device name will be used. (Since 2.7)
  #
+# @manual: True to use an expanded, more explicit job control flow.
+#  Jobs may transition from a running state to a pending state,
+#  where they must be instructed to complete manually via
+#  block-job-finalize.
+#  Jobs belonging to a transaction must either all or all not use this
+#  setting. Once a transaction reaches a pending state, issuing the
+#  finalize command to any one job in the transaction is sufficient
+#  to finalize the entire transaction.


The previous commit message talked about mixed-manual transactions, but 
this seems to imply it is not possible.  I'm fine if we don't support 
mixed-manual transactions, but wonder if it means any changes to the series.


Otherwise looks reasonable from the UI point of view.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 18/21] blockjobs: add block-job-finalize

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Instead of automatically transitioning from PENDING to CONCLUDED, gate
the .prepare() and .commit() phases behind an explicit acknowledgement
provided by the QMP monitor if manual completion mode has been requested.

This allows us to perform graph changes in prepare and/or commit so that
graph changes do not occur autonomously without knowledge of the
controlling management layer.

Transactions that have reached the "PENDING" state together can all be
moved to invoke their finalization methods by issuing block_job_finalize
to any one job in the transaction.

Jobs in a transaction with mixed job->manual settings will remain stuck
in the "WAITING" state until block_job_finalize is authored on the job(s)
that have reached the "PENDING" state.

These jobs are not allowed to progress because other jobs in the
transaction may still fail during their preparation phase during
finalization, so these jobs must remain in the WAITING phase until
success is guaranteed. These jobs will then automatically dismiss
themselves, but jobs that had the manual property set will remain
at CONCLUDED as normal.



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 17/21] blockjobs: add PENDING status and event

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

For jobs utilizing the new manual workflow, we intend to prohibit
them from modifying the block graph until the management layer provides
an explicit ACK via block-job-finalize to move the process forward.

To distinguish this runstate from "ready" or "waiting," we add a new
"pending" event.

For now, the transition from PENDING to CONCLUDED/ABORTING is automatic,
but a future commit will add the explicit block-job-finalize step.

Transitions:
Waiting -> Pending:   Normal transition.
Pending -> Concluded: Normal transition.
Pending -> Aborting:  Late transactional failures and cancellations.

Removed Transitions:
Waiting -> Concluded: Jobs must go to PENDING first.

Verbs:
Cancel: Can be applied to a pending job.




+##
+# @BLOCK_JOB_PENDING:
+#
+# Emitted when a block job is awaiting explicit authorization to finalize graph
+# changes via @block-job-finalize. If this job is part of a transaction, it 
will
+# not emit this event until the transaction has converged first.


Same question of whether this new event is always emitted (and older 
clients presumably ignore it), or only emitted for clients that 
requested new-style state management.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 16/21] blockjobs: add waiting status

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

For jobs that are stuck waiting on others in a transaction, it would
be nice to know that they are no longer "running" in that sense, but
instead are waiting on other jobs in the transaction.

Jobs that are "waiting" in this sense cannot be meaningfully altered
any longer as they have left their running loop. The only meaningful
user verb for jobs in this state is "cancel," which will cancel the
whole transaction, too.

Transitions:
Running -> Waiting:   Normal transition.
Ready   -> Waiting:   Normal transition.
Waiting -> Aborting:  Transactional cancellation.
Waiting -> Concluded: Normal transition.

Removed Transitions:
Running -> Concluded: Jobs must go to WAITING first.
Ready   -> Concluded: Jobs must go to WAITING fisrt.


s/fisrt/first/


+++ b/blockjob.c



@@ -3934,6 +3938,29 @@
  'offset': 'int',
  'speed' : 'int' } }
  
+##

+# @BLOCK_JOB_WAITING:
+#
+# Emitted when a block job that is part of a transction has stopped work and is


s/transction/transaction/


+# waiting for other jobs in the transaction to reach the same state.


Is this event emitted only for 'new-style' transactions (old drivers 
will never see it, because they don't request new style), or always (old 
drivers will see, but presumably ignore, it)?


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 15/21] blockjobs: add prepare callback

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Some jobs upon finalization may need to perform some work that can
still fail. If these jobs are part of a transaction, it's important
that these callbacks fail the entire transaction.

We allow for a new callback in addition to commit/abort/clean that
allows us the opportunity to have fairly late-breaking failures
in the transactional process.

The expected flow is:

- All jobs in a transaction converge to the WAITING state
   (added in a forthcoming commit)
- All jobs prepare to call either commit/abort
- If any job fails, is canceled, or fails preparation, all jobs
   call their .abort callback.
- All jobs enter the PENDING state, awaiting manual intervention
   (also added in a forthcoming commit)
- block-job-finalize is issued by the user/management layer
- All jobs call their commit callbacks.

Signed-off-by: John Snow 
---
  blockjob.c   | 34 +++---
  include/block/blockjob_int.h | 10 ++
  2 files changed, 41 insertions(+), 3 deletions(-)




@@ -467,17 +480,22 @@ static void block_job_cancel_async(BlockJob *job)
  job->cancelled = true;
  }
  
-static void block_job_txn_apply(BlockJobTxn *txn, void fn(BlockJob *))

+static int block_job_txn_apply(BlockJobTxn *txn, int fn(BlockJob *))
  {
  AioContext *ctx;
  BlockJob *job, *next;
+int rc;
  
  QLIST_FOREACH_SAFE(job, >jobs, txn_list, next) {

  ctx = blk_get_aio_context(job->blk);
  aio_context_acquire(ctx);
-fn(job);
+rc = fn(job);
  aio_context_release(ctx);
+if (rc) {
+break;
+}


This short-circuits the application of the function to the rest of the 
group.  Is that ever going to be a problem?


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 14/21] blockjobs: add block_job_txn_apply function

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Simply apply a function transaction-wide.
A few more uses of this in forthcoming patches.

Signed-off-by: John Snow 
---
  blockjob.c | 24 +++-
  1 file changed, 15 insertions(+), 9 deletions(-)




@@ -565,13 +577,7 @@ static void block_job_completed_txn_success(BlockJob *job)
  }
  }
  /* We are the last completed job, commit the transaction. */
-QLIST_FOREACH_SAFE(other_job, >jobs, txn_list, next) {
-ctx = blk_get_aio_context(other_job->blk);
-aio_context_acquire(ctx);
-assert(other_job->ret == 0);


We lost this assertion.  Hopefully that doesn't matter in the long run.


-block_job_completed_single(other_job);
-aio_context_release(ctx);
-}
+block_job_txn_apply(txn, block_job_completed_single);
  }
  
  /* Assumes the block_job_mutex is held */




Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 13/21] blockjobs: add commit, abort, clean helpers

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

The completed_single function is getting a little mucked up with
checking to see which callbacks exist, so let's factor them out.

Signed-off-by: John Snow 
---
  blockjob.c | 35 ++-
  1 file changed, 26 insertions(+), 9 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 12/21] blockjobs: ensure abort is called for cancelled jobs

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Presently, even if a job is canceled post-completion as a result of
a failing peer in a transaction, it will still call .commit because
nothing has updated or changed its return code.

The reason why this does not cause problems currently is because
backup's implementation of .commit checks for cancellation itself.

I'd like to simplify this contract:

(1) Abort is called if the job/transaction fails
(2) Commit is called if the job/transaction succeeds

To this end: A job's return code, if 0, will be forcibly set as
-ECANCELED if that job has already concluded. Remove the now
redundant check in the backup job implementation.

We need to check for cancellation in both block_job_completed
AND block_job_completed_single, because jobs may be cancelled between
those two calls; for instance in transactions.

The check in block_job_completed could be removed, but there's no
point in starting to attempt to succeed a transaction that we know
in advance will fail.

This does NOT affect mirror jobs that are "canceled" during their
synchronous phase. The mirror job itself forcibly sets the canceled
property to false prior to ceding control, so such cases will invoke
the "commit" callback.

Signed-off-by: John Snow 
---
  block/backup.c |  2 +-
  block/trace-events |  1 +
  blockjob.c | 19 +++
  3 files changed, 17 insertions(+), 5 deletions(-)


More lines of code, but the contract does seem simpler and useful for 
the later patches.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 11/21] blockjobs: add block_job_dismiss

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

For jobs that have reached their CONCLUDED state, prior to having their
last reference put down (meaning jobs that have completed successfully,
unsuccessfully, or have been canceled), allow the user to dismiss the
job's lingering status report via block-job-dismiss.

This gives management APIs the chance to conclusively determine if a job
failed or succeeded, even if the event broadcast was missed.

Note that jobs do not yet linger in any such state, they are freed
immediately upon reaching this previously-unnamed state. such a state is
added immediately in the next commit.

Verbs:
Dismiss: operates on CONCLUDED jobs only.
Signed-off-by: John Snow 
---


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 09/21] blockjobs: add CONCLUDED state

2018-02-27 Thread John Snow


On 02/27/2018 02:38 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> add a new state "CONCLUDED" that identifies a job that has ceased all
>> operations. The wording was chosen to avoid any phrasing that might
>> imply success, error, or cancellation. The task has simply ceased all
>> operation and can never again perform any work.
>>
>> ("finished", "done", and "completed" might all imply success.)
>>
>> Transitions:
>> Running  -> Concluded: normal completion
>> Ready    -> Concluded: normal completion
>> Aborting -> Concluded: error and cancellations
>>
>> Verbs:
>> None as of this commit. (a future commit adds 'dismiss')
>>
> 
> 
>> @@ -620,7 +623,9 @@ void block_job_user_resume(BlockJob *job, Error
>> **errp)
>>     void block_job_cancel(BlockJob *job)
>>   {
>> -    if (block_job_started(job)) {
>> +    if (job->status == BLOCK_JOB_STATUS_CONCLUDED) {
>> +    return;
>> +    } else if (block_job_started(job)) {
> 
> It's also possible to do:
> 
> if () {
>   return ;
> }
> if (...)
> 
> instead of chaining with an 'else if'.  Matter of personal taste, so I
> won't make you change it.
> 
> Reviewed-by: Eric Blake 
> 

I guess because in this case, what I did adds two SLOC instead of three.
If the other code did not need to be guarded by an if(), I'd otherwise
agree.



Re: [Qemu-block] [Qemu-devel] [RFC v4 10/21] blockjobs: add NULL state

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Add a new state that specifically demarcates when we begin to permanently
demolish a job after it has performed all work. This makes the transition
explicit in the STM table and highlights conditions under which a job may
be demolished.



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 07/21] blockjobs: add block_job_verb permission table

2018-02-27 Thread John Snow


On 02/27/2018 02:25 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> Which commands ("verbs") are appropriate for jobs in which state is
>> also somewhat burdensome to keep track of.
>>
>> As of this commit, it looks rather useless, but begins to look more
>> interesting the more states we add to the STM table.
>>
>> A recurring theme is that no verb will apply to an 'undefined' job.
> 
> Back to the argument of whether we even want that state.
> 

By design, it is intended to reject all commands.

Though, what I can do is just remove the UNDEFINED state but have the
"CREATED" state start at "1" still. In effect, this gives you the empty
row and column for the 'UNDEFINED' state, except now it's *literally*
undefined.

However, the transition then needs to be set explicitly. The way the
code is written now, the status *only* ever changes from the transition
function, which makes it easy to search for and reason about all
possible transition points which I consider a feature of the design.

>>
>> ===
>> Changes
>> ===
>>
>> (1)
>>
>> To facilitate "nice" error checking, all five major block-job verb
>> interfaces in blockjob.c now support an errp parameter:
>>
>> - block_job_user_cancel is added as a new interface.
>> - block_job_user_pause gains an errp paramter
>> - block_job_user_resume gains an errp parameter
>> - block_job_set_speed already had an errp parameter.
>> - block_job_complete already had an errp parameter.
>>
>> (2)
>>
>> block-job-pause and block-job-resume will no longer no-op when trying
>> to pause an already paused job, or trying to resume a job that isn't
>> paused. These functions will now report that they did not perform the
>> action requested because it was not possible.
>>
>> iotests have been adjusted to address this new behavior.
> 
> Seems reasonable.  Hopefully shouldn't trip up libvirt too badly (if
> libvirt attempted a redundant job transition that used to silently
> succeed and now fails, the failure message should be pretty obvious that
> it was a no-op attempt).
> 

Yeah, it's arguable whether or not this is an API change.

(A) I'm not prohibiting anything that would have worked before. I am
just notifying the client that whatever they were trying to do had no
effect.

(B) That notification is an error, though, so some code paths that may
have relied on jamming pause signals into the pipe may be surprised at
the new response.

If I can get away with it, I obviously prefer to warn the user that the
pause/resume had no effect or couldn't be applied.

>>
>> (3)
>>
>> block-job-complete doesn't worry about checking !block_job_started,
>> because the permission table guards against this.
>>
>> (4)
>>
>> test-bdrv-drain's job implementation needs to announce that it is
>> 'ready' now, in order to be completed.
>>
>> Signed-off-by: John Snow 
>> ---
> 
> Reviewed-by: Eric Blake 
> 



Re: [Qemu-block] [Qemu-devel] [RFC v4 09/21] blockjobs: add CONCLUDED state

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

add a new state "CONCLUDED" that identifies a job that has ceased all
operations. The wording was chosen to avoid any phrasing that might
imply success, error, or cancellation. The task has simply ceased all
operation and can never again perform any work.

("finished", "done", and "completed" might all imply success.)

Transitions:
Running  -> Concluded: normal completion
Ready-> Concluded: normal completion
Aborting -> Concluded: error and cancellations

Verbs:
None as of this commit. (a future commit adds 'dismiss')





@@ -620,7 +623,9 @@ void block_job_user_resume(BlockJob *job, Error **errp)
  
  void block_job_cancel(BlockJob *job)

  {
-if (block_job_started(job)) {
+if (job->status == BLOCK_JOB_STATUS_CONCLUDED) {
+return;
+} else if (block_job_started(job)) {


It's also possible to do:

if () {
  return ;
}
if (...)

instead of chaining with an 'else if'.  Matter of personal taste, so I 
won't make you change it.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 08/21] blockjobs: add ABORTING state

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Add a new state ABORTING.

This makes transitions from normative states to error states explicit
in the STM, and serves as a disambiguation for which states may complete
normally when normal end-states (CONCLUDED) are added in future commits.

Notably, Paused/Standby jobs do not transition directly to aborting,
as they must wake up first and cooperate in their cancellation.

Transitions:
Running -> Aborting: can be cancelled or encounter an error
Ready   -> Aborting: can be cancelled or encounter an error

Verbs:
None. The job must finish cleaning itself up and report its final status.

  +-+
  |UNDEFINED|
  +--+--+
 |
  +--v+
  |CREATED|
  +--++
 |
  +--v+ +--+
+-+RUNNING<->PAUSED|
| +--++ +--+
||
| +--v--+   +---+
+-+READY<--->STANDBY|
| +-+   +---+
|
+--v-+
|ABORTING|
++

Signed-off-by: John Snow 
---
  blockjob.c   | 31 ++-
  qapi/block-core.json |  7 ++-
  2 files changed, 24 insertions(+), 14 deletions(-)




+++ b/qapi/block-core.json
@@ -996,10 +996,15 @@
  # @standby: The job is ready, but paused. This is nearly identical to @paused.
  #   The job may return to @ready or otherwise be canceled.
  #
+# @aborting: The job is in the process of being aborted, and will finish with
+#an error. The job will afterwards report that it is @concluded.


@concluded isn't defined yet, but I'm okay with the minor future 
reference, as it's less churn to get to the description that works at 
the end of the series.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 07/21] blockjobs: add block_job_verb permission table

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Which commands ("verbs") are appropriate for jobs in which state is
also somewhat burdensome to keep track of.

As of this commit, it looks rather useless, but begins to look more
interesting the more states we add to the STM table.

A recurring theme is that no verb will apply to an 'undefined' job.


Back to the argument of whether we even want that state.



===
Changes
===

(1)

To facilitate "nice" error checking, all five major block-job verb
interfaces in blockjob.c now support an errp parameter:

- block_job_user_cancel is added as a new interface.
- block_job_user_pause gains an errp paramter
- block_job_user_resume gains an errp parameter
- block_job_set_speed already had an errp parameter.
- block_job_complete already had an errp parameter.

(2)

block-job-pause and block-job-resume will no longer no-op when trying
to pause an already paused job, or trying to resume a job that isn't
paused. These functions will now report that they did not perform the
action requested because it was not possible.

iotests have been adjusted to address this new behavior.


Seems reasonable.  Hopefully shouldn't trip up libvirt too badly (if 
libvirt attempted a redundant job transition that used to silently 
succeed and now fails, the failure message should be pretty obvious that 
it was a no-op attempt).




(3)

block-job-complete doesn't worry about checking !block_job_started,
because the permission table guards against this.

(4)

test-bdrv-drain's job implementation needs to announce that it is
'ready' now, in order to be completed.

Signed-off-by: John Snow 
---


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 05/21] blockjobs: add state transition table

2018-02-27 Thread John Snow


On 02/27/2018 01:58 PM, Eric Blake wrote:
> On 02/23/2018 05:51 PM, John Snow wrote:
>> The state transition table has mostly been implied. We're about to make
>> it a bit more complex, so let's make the STM explicit instead.
>>
>> Perform state transitions with a function that for now just asserts the
>> transition is appropriate.
>>
>> Transitions:
>> Undefined -> Created: During job initialization.
> 
> Unless we use Created as the default state 0 for g_new0().
> 

I liked the idea of letting jobs be created in an "indeterminate" state
until we actually initialize them to be of use -- that is, jobs that
could be said to semantically understand and act on the "START" verb
(which is, as of today, an internal command only.)

The only meaningful action on a job of indeterminate state, then, would
be to DEFINE that job. (I.e. what block_job_create does.)

What I'm getting at is that block_job_start() on a job that was just
created will explode, and I'd like "created" to mean "This job can be
started."

It's not a distinction that matters in the codebase RIGHT NOW, but
that's how I came to think of the STM. We could likely optimize that
transition out because we always create and immediately define it, but
it felt ... nicer from an (internal) API point of view to have defined
the construction and destruction transitions explicitly.

Anyway, it can be removed; it's not a hill worth dying on. I only insist
that the bike shed not be olive green.

>> Created   -> Running: Once the job is started.
>>    Jobs cannot transition from "Created" to "Paused"
>>    directly, but will instead synchronously
>> transition
>>    to running to paused immediately.
>> Running   -> Paused:  Normal workflow for pauses.
>> Running   -> Ready:   Normal workflow for jobs reaching their sync point.
>>    (e.g. mirror)
>> Ready -> Standby: Normal workflow for pausing ready jobs.
>> Paused    -> Running: Normal resume.
>> Standby   -> Ready:   Resume of a Standby job.
>>
>>
>> +-+
>> |UNDEFINED|
>> +--+--+
>>     |
>> +--v+
>> |CREATED|
>> +--++
>>     |
>> +--v+ +--+
>> |RUNNING<->PAUSED|
>> +--++ +--+
>>     |
>> +--v--+   +---+
>> |READY<--->STANDBY|
>> +-+   +---+
>>
>>
>> Notably, there is no state presently defined as of this commit that
>> deals with a job after the "running" or "ready" states, so this table
>> will be adjusted alongside the commits that introduce those states.
> 
> The ascii-art tables help in this and other patches.  Thanks for
> producing them.
> 
>>
>> Signed-off-by: John Snow 
>> ---
>>   block/trace-events |  3 +++
>>   blockjob.c | 42 --
>>   2 files changed, 39 insertions(+), 6 deletions(-)
>>
> 
>> +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1)
>> +{
>> +    BlockJobStatus s0 = job->status;
>> +    if (s0 == s1) {
>> +    return;
>> +    }

If I remove this clause, I actually would have noticed that technically
I attempt to transition from CREATED to CREATED. Maybe I ought to remove
this...

>> +    assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX);
> 
> Or, if you keep the zero state distinct from valid states, this could be
> 'assert(s1 > 0 && ...)'
> 
>> @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job)
>>   job->pause_count--;
>>   job->busy = true;
>>   job->paused = false;
>> -    job->status = BLOCK_JOB_STATUS_RUNNING;
>> +    block_job_state_transition(job, BLOCK_JOB_STATUS_RUNNING);
>>   bdrv_coroutine_enter(blk_bs(job->blk), job->co);
>>   }
>>   @@ -704,6 +733,7 @@ void *block_job_create(const char *job_id, const
>> BlockJobDriver *driver,
>>   job->refcnt    = 1;
>>   job->manual    = (flags & BLOCK_JOB_MANUAL);
>>   job->status    = BLOCK_JOB_STATUS_CREATED;
>> +    block_job_state_transition(job, BLOCK_JOB_STATUS_CREATED);
> 
> Oops, missing a deletion of the job->status assignment line.
> 

I am indeed.



Re: [Qemu-block] [Qemu-devel] [RFC v4 05/21] blockjobs: add state transition table

2018-02-27 Thread John Snow


On 02/27/2018 01:58 PM, Eric Blake wrote:
> The ascii-art tables help in this and other patches.  Thanks for
> producing them.

Funny story: the original version of the STM present in these patches
was too complex and I was not able to draw a chart for them.

Kevin's insistence that I do draw a chart led to considerable slimming
of possible transitions.

--js



Re: [Qemu-block] [Qemu-devel] [RFC v4 06/21] iotests: add pause_wait

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

Split out the pause command into the actual pause and the wait.
Not every usage presently needs to resubmit a pause request.

The intent with the next commit will be to explicitly disallow
redundant or meaningless pause/resume requests, so the tests
need to become more judicious to reflect that.

Signed-off-by: John Snow 
---
  tests/qemu-iotests/030|  6 ++
  tests/qemu-iotests/055| 17 ++---
  tests/qemu-iotests/iotests.py | 12 
  3 files changed, 16 insertions(+), 19 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 05/21] blockjobs: add state transition table

2018-02-27 Thread John Snow


On 02/27/2018 12:08 PM, Kevin Wolf wrote:
> Am 27.02.2018 um 17:45 hat John Snow geschrieben:
>>
>>
>> On 02/27/2018 11:27 AM, Kevin Wolf wrote:
>>> Am 24.02.2018 um 00:51 hat John Snow geschrieben:
 The state transition table has mostly been implied. We're about to make
 it a bit more complex, so let's make the STM explicit instead.

 Perform state transitions with a function that for now just asserts the
 transition is appropriate.

 Transitions:
 Undefined -> Created: During job initialization.
 Created   -> Running: Once the job is started.
   Jobs cannot transition from "Created" to "Paused"
   directly, but will instead synchronously transition
   to running to paused immediately.
 Running   -> Paused:  Normal workflow for pauses.
 Running   -> Ready:   Normal workflow for jobs reaching their sync point.
   (e.g. mirror)
 Ready -> Standby: Normal workflow for pausing ready jobs.
 Paused-> Running: Normal resume.
 Standby   -> Ready:   Resume of a Standby job.


 +-+
 |UNDEFINED|
 +--+--+
|
 +--v+
 |CREATED|
 +--++
|
 +--v+ +--+
 |RUNNING<->PAUSED|
 +--++ +--+
|
 +--v--+   +---+
 |READY<--->STANDBY|
 +-+   +---+


 Notably, there is no state presently defined as of this commit that
 deals with a job after the "running" or "ready" states, so this table
 will be adjusted alongside the commits that introduce those states.

 Signed-off-by: John Snow 
 ---
  block/trace-events |  3 +++
  blockjob.c | 42 --
  2 files changed, 39 insertions(+), 6 deletions(-)

 diff --git a/block/trace-events b/block/trace-events
 index 02dd80ff0c..b75a0c8409 100644
 --- a/block/trace-events
 +++ b/block/trace-events
 @@ -4,6 +4,9 @@
  bdrv_open_common(void *bs, const char *filename, int flags, const char 
 *format_name) "bs %p filename \"%s\" flags 0x%x format_name \"%s\""
  bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
  
 +# blockjob.c
 +block_job_state_transition(void *job,  int ret, const char *legal, const 
 char *s0, const char *s1) "job %p (ret: %d) attempting %s transition 
 (%s-->%s)"
 +
  # block/block-backend.c
  blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, 
 int flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x"
  blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, 
 int flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x"
 diff --git a/blockjob.c b/blockjob.c
 index 1be9c20cff..d745b3bb69 100644
 --- a/blockjob.c
 +++ b/blockjob.c
 @@ -28,6 +28,7 @@
  #include "block/block.h"
  #include "block/blockjob_int.h"
  #include "block/block_int.h"
 +#include "block/trace.h"
  #include "sysemu/block-backend.h"
  #include "qapi/error.h"
  #include "qapi/qmp/qerror.h"
 @@ -41,6 +42,34 @@
   * block_job_enter. */
  static QemuMutex block_job_mutex;
  
 +/* BlockJob State Transition Table */
 +bool BlockJobSTT[BLOCK_JOB_STATUS__MAX][BLOCK_JOB_STATUS__MAX] = {
 +  /* U, C, R, P, Y, S */
 +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0},
>>>
>>> Even at the end of the series, this is the only use of
>>> BLOCK_JOB_STATUS_UNDEFINED.
>>>
 +/* C: */ [BLOCK_JOB_STATUS_CREATED]   = {0, 0, 1, 0, 0, 0},
 +/* R: */ [BLOCK_JOB_STATUS_RUNNING]   = {0, 0, 0, 1, 1, 0},
 +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0},
 +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1},
 +/* S: */ [BLOCK_JOB_STATUS_STANDBY]   = {0, 0, 0, 0, 1, 0},
 +};
 +
 +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1)
 +{
 +BlockJobStatus s0 = job->status;
 +if (s0 == s1) {
 +return;
 +}
 +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX);
 +trace_block_job_state_transition(job, job->ret, BlockJobSTT[s0][s1] ?
 + "allowed" : "disallowed",
 + 
 qapi_enum_lookup(_lookup,
 +  s0),
 + 
 qapi_enum_lookup(_lookup,
 +  s1));
 +assert(BlockJobSTT[s0][s1]);
 +job->status = s1;
 +}
 +
  static void block_job_lock(void)
  {
  qemu_mutex_lock(_job_mutex);
 @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job)
  job->pause_count--;
  

Re: [Qemu-block] [Qemu-devel] [RFC v4 05/21] blockjobs: add state transition table

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

The state transition table has mostly been implied. We're about to make
it a bit more complex, so let's make the STM explicit instead.

Perform state transitions with a function that for now just asserts the
transition is appropriate.

Transitions:
Undefined -> Created: During job initialization.


Unless we use Created as the default state 0 for g_new0().


Created   -> Running: Once the job is started.
   Jobs cannot transition from "Created" to "Paused"
   directly, but will instead synchronously transition
   to running to paused immediately.
Running   -> Paused:  Normal workflow for pauses.
Running   -> Ready:   Normal workflow for jobs reaching their sync point.
   (e.g. mirror)
Ready -> Standby: Normal workflow for pausing ready jobs.
Paused-> Running: Normal resume.
Standby   -> Ready:   Resume of a Standby job.


+-+
|UNDEFINED|
+--+--+
|
+--v+
|CREATED|
+--++
|
+--v+ +--+
|RUNNING<->PAUSED|
+--++ +--+
|
+--v--+   +---+
|READY<--->STANDBY|
+-+   +---+


Notably, there is no state presently defined as of this commit that
deals with a job after the "running" or "ready" states, so this table
will be adjusted alongside the commits that introduce those states.


The ascii-art tables help in this and other patches.  Thanks for 
producing them.




Signed-off-by: John Snow 
---
  block/trace-events |  3 +++
  blockjob.c | 42 --
  2 files changed, 39 insertions(+), 6 deletions(-)




+static void block_job_state_transition(BlockJob *job, BlockJobStatus s1)
+{
+BlockJobStatus s0 = job->status;
+if (s0 == s1) {
+return;
+}
+assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX);


Or, if you keep the zero state distinct from valid states, this could be 
'assert(s1 > 0 && ...)'



@@ -320,7 +349,7 @@ void block_job_start(BlockJob *job)
  job->pause_count--;
  job->busy = true;
  job->paused = false;
-job->status = BLOCK_JOB_STATUS_RUNNING;
+block_job_state_transition(job, BLOCK_JOB_STATUS_RUNNING);
  bdrv_coroutine_enter(blk_bs(job->blk), job->co);
  }
  
@@ -704,6 +733,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver,

  job->refcnt= 1;
  job->manual= (flags & BLOCK_JOB_MANUAL);
  job->status= BLOCK_JOB_STATUS_CREATED;
+block_job_state_transition(job, BLOCK_JOB_STATUS_CREATED);


Oops, missing a deletion of the job->status assignment line.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [RFC v4 07/21] blockjobs: add block_job_verb permission table

2018-02-27 Thread Kevin Wolf
Am 24.02.2018 um 00:51 hat John Snow geschrieben:
> Which commands ("verbs") are appropriate for jobs in which state is
> also somewhat burdensome to keep track of.
> 
> As of this commit, it looks rather useless, but begins to look more
> interesting the more states we add to the STM table.
> 
> A recurring theme is that no verb will apply to an 'undefined' job.
> 
> Further, it's not presently possible to restrict the "pause" or "resume"
> verbs any more than they are in this commit because of the asynchronous
> nature of how jobs enter the PAUSED state; justifications for some
> seemingly erroneous applications are given below.
> 
> =
> Verbs
> =
> 
> Cancel:Any state except undefined.
> Pause: Any state except undefined;
>'created': Requests that the job pauses as it starts.
>'running': Normal usage. (PAUSED)
>'paused':  The job may be paused for internal reasons,
>   but the user may wish to force an indefinite
>   user-pause, so this is allowed.
>'ready':   Normal usage. (STANDBY)
>'standby': Same logic as above.
> Resume:Any state except undefined;
>'created': Will lift a user's pause-on-start request.
>'running': Will lift a pause request before it takes effect.
>'paused':  Normal usage.
>'ready':   Will lift a pause request before it takes effect.
>'standby': Normal usage.
> Set-speed: Any state except undefined, though ready may not be meaningful.
> Complete:  Only a 'ready' job may accept a complete request.
> 
> 
> ===
> Changes
> ===
> 
> (1)
> 
> To facilitate "nice" error checking, all five major block-job verb
> interfaces in blockjob.c now support an errp parameter:
> 
> - block_job_user_cancel is added as a new interface.
> - block_job_user_pause gains an errp paramter
> - block_job_user_resume gains an errp parameter
> - block_job_set_speed already had an errp parameter.
> - block_job_complete already had an errp parameter.
> 
> (2)
> 
> block-job-pause and block-job-resume will no longer no-op when trying
> to pause an already paused job, or trying to resume a job that isn't
> paused. These functions will now report that they did not perform the
> action requested because it was not possible.
> 
> iotests have been adjusted to address this new behavior.
> 
> (3)
> 
> block-job-complete doesn't worry about checking !block_job_started,
> because the permission table guards against this.
> 
> (4)
> 
> test-bdrv-drain's job implementation needs to announce that it is
> 'ready' now, in order to be completed.
> 
> Signed-off-by: John Snow 

Reviewed-by: Kevin Wolf 



Re: [Qemu-block] [RFC v4 04/21] blockjobs: add status enum

2018-02-27 Thread Kevin Wolf
Am 24.02.2018 um 00:51 hat John Snow geschrieben:
> We're about to add several new states, and booleans are becoming
> unwieldly and difficult to reason about. It would help to have a
> more explicit bookkeeping of the state of blockjobs. To this end,
> add a new "status" field and add our existing states in a redundant
> manner alongside the bools they are replacing:
> 
> UNDEFINED: Placeholder, default state. Not currently visible to QMP
>unless changes occur in the future to allow creating jobs
>without starting them via QMP.
> CREATED:   replaces !!job->co && paused && !busy
> RUNNING:   replaces effectively (!paused && busy)
> PAUSED:Nearly redundant with info->paused, which shows pause_count.
>This reports the actual status of the job, which almost always
>matches the paused request status. It differs in that it is
>strictly only true when the job has actually gone dormant.
> READY: replaces job->ready.
> STANDBY:   Paused, but job->ready is true.
> 
> New state additions in coming commits will not be quite so redundant:
> 
> WAITING:   Waiting on transaction. This job has finished all the work
>it can until the transaction converges, fails, or is canceled.
> PENDING:   Pending authorization from user. This job has finished all the
>work it can until the job or transaction is finalized via
>block_job_finalize. This implies the transaction has converged
>and left the WAITING phase.
> ABORTING:  Job has encountered an error condition and is in the process
>of aborting.
> CONCLUDED: Job has ceased all operations and has a return code available
>for query and may be dismissed via block_job_dismiss.
> NULL:  Job has been dismissed and (should) be destroyed. Should never
>be visible to QMP.
> 
> Some of these states appear somewhat superfluous, but it helps define the
> expected flow of a job; so some of the states wind up being synchronous
> empty transitions. Importantly, jobs can be in only one of these states
> at any given time, which helps code and external users alike reason about
> the current condition of a job unambiguously.
> 
> Signed-off-by: John Snow 

Reviewed-by: Kevin Wolf 



Re: [Qemu-block] [RFC v4 06/21] iotests: add pause_wait

2018-02-27 Thread Kevin Wolf
Am 24.02.2018 um 00:51 hat John Snow geschrieben:
> Split out the pause command into the actual pause and the wait.
> Not every usage presently needs to resubmit a pause request.
> 
> The intent with the next commit will be to explicitly disallow
> redundant or meaningless pause/resume requests, so the tests
> need to become more judicious to reflect that.
> 
> Signed-off-by: John Snow 

Reviewed-by: Kevin Wolf 



Re: [Qemu-block] [RFC v4 02/21] blockjobs: model single jobs as transactions

2018-02-27 Thread Kevin Wolf
Am 24.02.2018 um 00:51 hat John Snow geschrieben:
> model all independent jobs as single job transactions.
> 
> It's one less case we have to worry about when we add more states to the
> transition machine. This way, we can just treat all job lifetimes exactly
> the same. This helps tighten assertions of the STM graph and removes some
> conditionals that would have been needed in the coming commits adding a
> more explicit job lifetime management API.
> 
> Signed-off-by: John Snow 

Reviewed-by: Kevin Wolf 



Re: [Qemu-block] [RFC v4 01/21] blockjobs: fix set-speed kick

2018-02-27 Thread Kevin Wolf
Am 24.02.2018 um 00:51 hat John Snow geschrieben:
> If speed is '0' it's not actually "less than" the previous speed.
> Kick the job in this case too.
> 
> Signed-off-by: John Snow 

Reviewed-by: Kevin Wolf 



Re: [Qemu-block] [RFC v4 03/21] blockjobs: add manual property

2018-02-27 Thread Kevin Wolf
Am 24.02.2018 um 00:51 hat John Snow geschrieben:
> This property will be used to opt-in to the new BlockJobs workflow
> that allows a tighter, more explicit control over transitions from
> one runstate to another.
> 
> While we're here, fix up the documentation for block_job_create
> a little bit.
> 
> Signed-off-by: John Snow 

Reviewed-by: Kevin Wolf 



Re: [Qemu-block] [RFC v4 05/21] blockjobs: add state transition table

2018-02-27 Thread Kevin Wolf
Am 27.02.2018 um 17:45 hat John Snow geschrieben:
> 
> 
> On 02/27/2018 11:27 AM, Kevin Wolf wrote:
> > Am 24.02.2018 um 00:51 hat John Snow geschrieben:
> >> The state transition table has mostly been implied. We're about to make
> >> it a bit more complex, so let's make the STM explicit instead.
> >>
> >> Perform state transitions with a function that for now just asserts the
> >> transition is appropriate.
> >>
> >> Transitions:
> >> Undefined -> Created: During job initialization.
> >> Created   -> Running: Once the job is started.
> >>   Jobs cannot transition from "Created" to "Paused"
> >>   directly, but will instead synchronously transition
> >>   to running to paused immediately.
> >> Running   -> Paused:  Normal workflow for pauses.
> >> Running   -> Ready:   Normal workflow for jobs reaching their sync point.
> >>   (e.g. mirror)
> >> Ready -> Standby: Normal workflow for pausing ready jobs.
> >> Paused-> Running: Normal resume.
> >> Standby   -> Ready:   Resume of a Standby job.
> >>
> >>
> >> +-+
> >> |UNDEFINED|
> >> +--+--+
> >>|
> >> +--v+
> >> |CREATED|
> >> +--++
> >>|
> >> +--v+ +--+
> >> |RUNNING<->PAUSED|
> >> +--++ +--+
> >>|
> >> +--v--+   +---+
> >> |READY<--->STANDBY|
> >> +-+   +---+
> >>
> >>
> >> Notably, there is no state presently defined as of this commit that
> >> deals with a job after the "running" or "ready" states, so this table
> >> will be adjusted alongside the commits that introduce those states.
> >>
> >> Signed-off-by: John Snow 
> >> ---
> >>  block/trace-events |  3 +++
> >>  blockjob.c | 42 --
> >>  2 files changed, 39 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/block/trace-events b/block/trace-events
> >> index 02dd80ff0c..b75a0c8409 100644
> >> --- a/block/trace-events
> >> +++ b/block/trace-events
> >> @@ -4,6 +4,9 @@
> >>  bdrv_open_common(void *bs, const char *filename, int flags, const char 
> >> *format_name) "bs %p filename \"%s\" flags 0x%x format_name \"%s\""
> >>  bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
> >>  
> >> +# blockjob.c
> >> +block_job_state_transition(void *job,  int ret, const char *legal, const 
> >> char *s0, const char *s1) "job %p (ret: %d) attempting %s transition 
> >> (%s-->%s)"
> >> +
> >>  # block/block-backend.c
> >>  blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, 
> >> int flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x"
> >>  blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, 
> >> int flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x"
> >> diff --git a/blockjob.c b/blockjob.c
> >> index 1be9c20cff..d745b3bb69 100644
> >> --- a/blockjob.c
> >> +++ b/blockjob.c
> >> @@ -28,6 +28,7 @@
> >>  #include "block/block.h"
> >>  #include "block/blockjob_int.h"
> >>  #include "block/block_int.h"
> >> +#include "block/trace.h"
> >>  #include "sysemu/block-backend.h"
> >>  #include "qapi/error.h"
> >>  #include "qapi/qmp/qerror.h"
> >> @@ -41,6 +42,34 @@
> >>   * block_job_enter. */
> >>  static QemuMutex block_job_mutex;
> >>  
> >> +/* BlockJob State Transition Table */
> >> +bool BlockJobSTT[BLOCK_JOB_STATUS__MAX][BLOCK_JOB_STATUS__MAX] = {
> >> +  /* U, C, R, P, Y, S */
> >> +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0},
> > 
> > Even at the end of the series, this is the only use of
> > BLOCK_JOB_STATUS_UNDEFINED.
> > 
> >> +/* C: */ [BLOCK_JOB_STATUS_CREATED]   = {0, 0, 1, 0, 0, 0},
> >> +/* R: */ [BLOCK_JOB_STATUS_RUNNING]   = {0, 0, 0, 1, 1, 0},
> >> +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0},
> >> +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1},
> >> +/* S: */ [BLOCK_JOB_STATUS_STANDBY]   = {0, 0, 0, 0, 1, 0},
> >> +};
> >> +
> >> +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1)
> >> +{
> >> +BlockJobStatus s0 = job->status;
> >> +if (s0 == s1) {
> >> +return;
> >> +}
> >> +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX);
> >> +trace_block_job_state_transition(job, job->ret, BlockJobSTT[s0][s1] ?
> >> + "allowed" : "disallowed",
> >> + 
> >> qapi_enum_lookup(_lookup,
> >> +  s0),
> >> + 
> >> qapi_enum_lookup(_lookup,
> >> +  s1));
> >> +assert(BlockJobSTT[s0][s1]);
> >> +job->status = s1;
> >> +}
> >> +
> >>  static void block_job_lock(void)
> >>  {
> >>  qemu_mutex_lock(_job_mutex);
> >> @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job)
> >>  job->pause_count--;
> >>  job->busy = true;
> >>  job->paused = false;
> 

[Qemu-block] [PATCH v4 3/5] qcow2: Reduce REFT_OFFSET_MASK

2018-02-27 Thread Eric Blake
Match our code to the spec change in the previous patch - there's
no reason for the refcount table to allow larger offsets than the
L1/L2 tables. In practice, no image has more than 64PB of
allocated clusters anyways, as anything beyond that can't be
expressed via L2 mappings to host offsets.

Suggested-by: Alberto Garcia 
Signed-off-by: Eric Blake 

---
v4: new patch
---
 block/qcow2.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 883802241fb..afc524ead54 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -442,7 +442,7 @@ typedef enum QCow2MetadataOverlap {
 #define L2E_OFFSET_MASK 0x00fffe00ULL
 #define L2E_COMPRESSED_OFFSET_SIZE_MASK 0x3fffULL

-#define REFT_OFFSET_MASK 0xfe00ULL
+#define REFT_OFFSET_MASK 0x00fffe00ULL

 static inline int64_t start_of_cluster(BDRVQcow2State *s, int64_t offset)
 {
-- 
2.14.3




[Qemu-block] [PATCH v4 5/5] qcow2: Avoid memory over-allocation on compressed images

2018-02-27 Thread Eric Blake
When reading a compressed image, we were allocating s->cluster_data
to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
with 2M clusters).  Let's check out the history:

Back when qcow2 was first written, we used s->cluster_data for
everything, including copy_sectors() and encryption, where we want
to operate on more than one cluster at once.  Obviously, at that
point, the buffer had to be aligned for other users, even though
compression itself doesn't require any alignment (the fact that
the compressed data generally starts mid-sector means that aligning
our buffer buys us nothing - either the protocol already supports
byte-based access into whatever offset we want, or we are already
using a bounce buffer to read a full sector, and copying into
our destination no longer requires alignment).

But commit 1b9f1491 (v1.1!) changed things to allocate parallel
buffers on demand rather than sharing a single buffer, for encryption
and COW, leaving compression as the final client of s->cluster_data.
That use was still preserved, because if a single compressed cluster
is read more than once, we reuse the cache instead of decompressing
it a second time (someday, we may come up with better caching to
avoid wasting repeated decompressions while still being more parallel,
but that is a task for another patch; the XXX comment in
qcow2_co_preadv for QCOW2_CLUSTER_COMPRESSED is telling).

Much later, in commit de82815d (v2.2), we noticed that a 64M
allocation is prone to failure, so we switched over to a graceful
memory allocation error message.  Elsewhere in the code, we do
g_malloc(2 * cluster_size) without ever checking for failure, but
even 4M starts to be large enough that trying to be nice is worth
the effort, so we want to keep that aspect.

Then even later, in 3e4c7052 (2.11), we realized that allocating
a large buffer up front for every qcow2 image is expensive, and
switched to lazy allocation only for images that actually had
compressed clusters.  But in the process, we never even bothered
to check whether what we were allocating still made sense in its
new context!

So, it's time to cut back on the waste.  A compressed cluster
written by qemu will NEVER occupy more than an uncompressed
cluster, but based on mid-sector alignment, we may still need
to read 1 cluster + 1 sector in order to recover enough bytes
for the decompression.  But third-party producers of qcow2 may
not be as smart, and gzip DOES document that because the
compression stream adds metadata, and because of the pigeonhole
principle, there are worst case scenarios where attempts to
compress will actually inflate an image, by up to 0.015% (or 62
sectors larger for an unfortunate 2M compression).  In fact,
the qcow2 spec permits an all-ones sector count, plus a full
sector containing the initial offset, for a maximum read of
2 full clusters; and thanks to the way decompression works,
even though such a value is probably too large for the actual
compressed data, it really doesn't matter if we read too much
(gzip ignores slop, once it has decoded a full cluster).  So
it's easier to just allocate cluster_data to be as large as we
can ever possibly see; even if it still wastes up to 2M on any
image created by qcow2, that's still an improvment of 60M less
waste than pre-patch.

Signed-off-by: Eric Blake 

---
v4: fix off-by-one in assertion and commit message [Berto]
v3: tighten allocation [Berto]
v2: actually check allocation failure (previous version meant
to use g_malloc, but ended up posted with g_try_malloc without
checking); add assertions outside of conditional, improve
commit message to better match reality now that qcow2 spec bug
has been fixed
---
 block/qcow2-cluster.c | 27 ++-
 block/qcow2.c |  2 +-
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 85be7d5e340..1f7f00cbea2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1598,20 +1598,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, 
uint64_t cluster_offset)
 sector_offset = coffset & 511;
 csize = nb_csectors * 512 - sector_offset;

-/* Allocate buffers on first decompress operation, most images are
- * uncompressed and the memory overhead can be avoided.  The buffers
- * are freed in .bdrv_close().
+/* Allocate buffers on the first decompress operation; most
+ * images are uncompressed and the memory overhead can be
+ * avoided.  The buffers are freed in .bdrv_close().  qemu
+ * never writes an inflated cluster, and gzip itself never
+ * inflates a problematic cluster by more than 0.015%, but the
+ * qcow2 format allows up to 1 byte short of 2 full clusters
+ * when including the sector containing offset.  gzip ignores
+ * trailing slop, so just allocate that much up front rather
+ * than reject third-party images with 

Re: [Qemu-block] [RFC v4 05/21] blockjobs: add state transition table

2018-02-27 Thread Kevin Wolf
Am 24.02.2018 um 00:51 hat John Snow geschrieben:
> The state transition table has mostly been implied. We're about to make
> it a bit more complex, so let's make the STM explicit instead.
> 
> Perform state transitions with a function that for now just asserts the
> transition is appropriate.
> 
> Transitions:
> Undefined -> Created: During job initialization.
> Created   -> Running: Once the job is started.
>   Jobs cannot transition from "Created" to "Paused"
>   directly, but will instead synchronously transition
>   to running to paused immediately.
> Running   -> Paused:  Normal workflow for pauses.
> Running   -> Ready:   Normal workflow for jobs reaching their sync point.
>   (e.g. mirror)
> Ready -> Standby: Normal workflow for pausing ready jobs.
> Paused-> Running: Normal resume.
> Standby   -> Ready:   Resume of a Standby job.
> 
> 
> +-+
> |UNDEFINED|
> +--+--+
>|
> +--v+
> |CREATED|
> +--++
>|
> +--v+ +--+
> |RUNNING<->PAUSED|
> +--++ +--+
>|
> +--v--+   +---+
> |READY<--->STANDBY|
> +-+   +---+
> 
> 
> Notably, there is no state presently defined as of this commit that
> deals with a job after the "running" or "ready" states, so this table
> will be adjusted alongside the commits that introduce those states.
> 
> Signed-off-by: John Snow 
> ---
>  block/trace-events |  3 +++
>  blockjob.c | 42 --
>  2 files changed, 39 insertions(+), 6 deletions(-)
> 
> diff --git a/block/trace-events b/block/trace-events
> index 02dd80ff0c..b75a0c8409 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -4,6 +4,9 @@
>  bdrv_open_common(void *bs, const char *filename, int flags, const char 
> *format_name) "bs %p filename \"%s\" flags 0x%x format_name \"%s\""
>  bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
>  
> +# blockjob.c
> +block_job_state_transition(void *job,  int ret, const char *legal, const 
> char *s0, const char *s1) "job %p (ret: %d) attempting %s transition 
> (%s-->%s)"
> +
>  # block/block-backend.c
>  blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, int 
> flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x"
>  blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, int 
> flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x"
> diff --git a/blockjob.c b/blockjob.c
> index 1be9c20cff..d745b3bb69 100644
> --- a/blockjob.c
> +++ b/blockjob.c
> @@ -28,6 +28,7 @@
>  #include "block/block.h"
>  #include "block/blockjob_int.h"
>  #include "block/block_int.h"
> +#include "block/trace.h"
>  #include "sysemu/block-backend.h"
>  #include "qapi/error.h"
>  #include "qapi/qmp/qerror.h"
> @@ -41,6 +42,34 @@
>   * block_job_enter. */
>  static QemuMutex block_job_mutex;
>  
> +/* BlockJob State Transition Table */
> +bool BlockJobSTT[BLOCK_JOB_STATUS__MAX][BLOCK_JOB_STATUS__MAX] = {
> +  /* U, C, R, P, Y, S */
> +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0},

Even at the end of the series, this is the only use of
BLOCK_JOB_STATUS_UNDEFINED.

> +/* C: */ [BLOCK_JOB_STATUS_CREATED]   = {0, 0, 1, 0, 0, 0},
> +/* R: */ [BLOCK_JOB_STATUS_RUNNING]   = {0, 0, 0, 1, 1, 0},
> +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0},
> +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1},
> +/* S: */ [BLOCK_JOB_STATUS_STANDBY]   = {0, 0, 0, 0, 1, 0},
> +};
> +
> +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1)
> +{
> +BlockJobStatus s0 = job->status;
> +if (s0 == s1) {
> +return;
> +}
> +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX);
> +trace_block_job_state_transition(job, job->ret, BlockJobSTT[s0][s1] ?
> + "allowed" : "disallowed",
> + qapi_enum_lookup(_lookup,
> +  s0),
> + qapi_enum_lookup(_lookup,
> +  s1));
> +assert(BlockJobSTT[s0][s1]);
> +job->status = s1;
> +}
> +
>  static void block_job_lock(void)
>  {
>  qemu_mutex_lock(_job_mutex);
> @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job)
>  job->pause_count--;
>  job->busy = true;
>  job->paused = false;
> -job->status = BLOCK_JOB_STATUS_RUNNING;
> +block_job_state_transition(job, BLOCK_JOB_STATUS_RUNNING);
>  bdrv_coroutine_enter(blk_bs(job->blk), job->co);
>  }
>  
> @@ -704,6 +733,7 @@ void *block_job_create(const char *job_id, const 
> BlockJobDriver *driver,
>  job->refcnt= 1;
>  job->manual= (flags & BLOCK_JOB_MANUAL);
>  job->status= BLOCK_JOB_STATUS_CREATED;
> +block_job_state_transition(job, 

Re: [Qemu-block] [Qemu-devel] [RFC v4 02/21] blockjobs: model single jobs as transactions

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

model all independent jobs as single job transactions.

It's one less case we have to worry about when we add more states to the
transition machine. This way, we can just treat all job lifetimes exactly
the same. This helps tighten assertions of the STM graph and removes some
conditionals that would have been needed in the coming commits adding a
more explicit job lifetime management API.

Signed-off-by: John Snow 
---


Reviewed-by: Eric Blake 


@@ -729,6 +727,17 @@ void *block_job_create(const char *job_id, const 
BlockJobDriver *driver,
  return NULL;
  }
  }
+
+/* Single jobs are modeled as single-job transactions for sake of
+ * consolidating the job management logic */
+if (!txn) {
+txn = block_job_txn_new();
+block_job_txn_add_job(txn, job);
+block_job_txn_unref(txn);
+} else {
+block_job_txn_add_job(txn, job);
+}
+
  return job;


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



[Qemu-block] [PATCH v4 1/5] qcow2: Prefer byte-based calls into bs->file

2018-02-27 Thread Eric Blake
We had only three sector-based stragglers left; convert them to use
our preferred byte-based accesses.

Signed-off-by: Eric Blake 
Reviewed-by: Alberto Garcia 

---
v2: indentation fix
---
 block/qcow2-cluster.c  | 5 ++---
 block/qcow2-refcount.c | 6 +++---
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index e406b0f3b9e..85be7d5e340 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1615,13 +1615,12 @@ int qcow2_decompress_cluster(BlockDriverState *bs, 
uint64_t cluster_offset)
 }

 BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
-ret = bdrv_read(bs->file, coffset >> 9, s->cluster_data,
-nb_csectors);
+ret = bdrv_pread(bs->file, coffset, s->cluster_data, csize);
 if (ret < 0) {
 return ret;
 }
 if (decompress_buffer(s->cluster_cache, s->cluster_size,
-  s->cluster_data + sector_offset, csize) < 0) {
+  s->cluster_data, csize) < 0) {
 return -EIO;
 }
 s->cluster_cache_offset = coffset;
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index d46b69d7f34..28afbb1b5ea 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -2310,8 +2310,8 @@ write_refblocks:
 on_disk_refblock = (void *)((char *) *refcount_table +
 refblock_index * s->cluster_size);

-ret = bdrv_write(bs->file, refblock_offset / BDRV_SECTOR_SIZE,
- on_disk_refblock, s->cluster_sectors);
+ret = bdrv_pwrite(bs->file, refblock_offset, on_disk_refblock,
+  s->cluster_size);
 if (ret < 0) {
 fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret));
 goto fail;
@@ -2533,7 +2533,7 @@ fail:
  * - 0 if writing to this offset will not affect the mentioned metadata
  * - a positive QCow2MetadataOverlap value indicating one overlapping section
  * - a negative value (-errno) indicating an error while performing a check,
- *   e.g. when bdrv_read failed on QCOW2_OL_INACTIVE_L2
+ *   e.g. when bdrv_pread failed on QCOW2_OL_INACTIVE_L2
  */
 int qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset,
  int64_t size)
-- 
2.14.3




Re: [Qemu-block] [Qemu-devel] [RFC v4 03/21] blockjobs: add manual property

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

This property will be used to opt-in to the new BlockJobs workflow
that allows a tighter, more explicit control over transitions from
one runstate to another.

While we're here, fix up the documentation for block_job_create
a little bit.

Signed-off-by: John Snow 
---
  blockjob.c   |  1 +
  include/block/blockjob.h | 10 ++
  include/block/blockjob_int.h |  4 +++-
  3 files changed, 14 insertions(+), 1 deletion(-)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [RFC v4 01/21] blockjobs: fix set-speed kick

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

If speed is '0' it's not actually "less than" the previous speed.
Kick the job in this case too.

Signed-off-by: John Snow 
---
  blockjob.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)


Reviewed-by: Eric Blake 



diff --git a/blockjob.c b/blockjob.c
index 3f52f29f75..24833ef30f 100644
--- a/blockjob.c
+++ b/blockjob.c
@@ -499,7 +499,7 @@ void block_job_set_speed(BlockJob *job, int64_t speed, 
Error **errp)
  }
  
  job->speed = speed;

-if (speed <= old_speed) {
+if (speed && speed <= old_speed) {
  return;
  }
  



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



[Qemu-block] [PATCH v4 2/5] qcow2: Document some maximum size constraints

2018-02-27 Thread Eric Blake
Although off_t permits up to 63 bits (8EB) of file offsets, in
practice, we're going to hit other limits first.  Document some
of those limits in the qcow2 spec, and how choice of cluster size
can influence some of the limits.

While at it, notice that since we cannot map any virtual cluster
to any address higher than 64 PB (56 bits) (due to the current L1/L2
field encoding stopping at bit 55), it makes little sense to require
the refcount table to access host offsets beyond that point.  Mark
the upper bits of the refcount table entries as reserved to match
the L1/L2 table, with no ill effects, since it is unlikely that there
are any existing images larger than 64PB in the first place, and thus
all existing images already have those bits as 0.  If 64PB proves to
be too small in the future, we could enlarge all three uses of bit
55 into the reserved bits at that time.

However, there is one limit that reserved bits don't help with: for
compressed clusters, the L2 layout requires an ever-smaller maximum
host offset as cluster size gets larger, down to a 512 TB maximum
with 2M clusters.

Signed-off-by: Eric Blake 

--
v4: more wording tweaks
v3: new patch
---
 docs/interop/qcow2.txt | 38 +++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index feb711fb6a8..e32d391e66b 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -40,7 +40,17 @@ The first cluster of a qcow2 image contains the file header:
 with larger cluster sizes.

  24 - 31:   size
-Virtual disk size in bytes
+Virtual disk size in bytes.
+
+Note: with a 2 MB cluster size, the maximum
+virtual size is 2 EB (61 bits) for a sparse file,
+but other sizing limitations in refcount and L1/L2
+tables mean that an image cannot have more than 64
+PB of populated clusters (and an image may hit
+other sizing limitations as well, such as
+underlying protocol limits).  With a 512 byte
+cluster size, the maximum virtual size drops to
+128 GB (37 bits).

  32 - 35:   crypt_method
 0 for no encryption
@@ -318,6 +328,13 @@ for each host cluster. A refcount of 0 means that the 
cluster is free, 1 means
 that it is used, and >= 2 means that it is used and any write access must
 perform a COW (copy on write) operation.

+The refcount table has implications on the maximum host file size; a
+larger cluster size is required for the refcount table to cover larger
+offsets.  Furthermore, all qcow2 metadata must currently reside at
+offsets below 64 PB (56 bits) (this limit could be enlarged by putting
+reserved bits into use, but only if a similar limit on L1/L2 tables is
+revisited at the same time).
+
 The refcounts are managed in a two-level table. The first level is called
 refcount table and has a variable size (which is stored in the header). The
 refcount table can cover multiple clusters, however it needs to be contiguous
@@ -341,7 +358,7 @@ Refcount table entry:

 Bit  0 -  8:Reserved (set to 0)

- 9 - 63:Bits 9-63 of the offset into the image file at which the
+ 9 - 55:Bits 9-55 of the offset into the image file at which the
 refcount block starts. Must be aligned to a cluster
 boundary.

@@ -349,6 +366,8 @@ Refcount table entry:
 been allocated. All refcounts managed by this refcount 
block
 are 0.

+56 - 63:Reserved (set to 0)
+
 Refcount block entry (x = refcount_bits - 1):

 Bit  0 -  x:Reference count of the cluster. If refcount_bits implies a
@@ -365,6 +384,17 @@ The L1 table has a variable size (stored in the header) 
and may use multiple
 clusters, however it must be contiguous in the image file. L2 tables are
 exactly one cluster in size.

+The L1 and L2 tables have implications on the maximum virtual file
+size; a larger cluster size is required for the guest to have access
+to more space.  Furthermore, a virtual cluster must currently map to a
+host offset below 64 PB (56 bits) (this limit could be enlarged by
+putting reserved bits into use, but only if a similar limit on
+refcount tables is revisited at the same time).  Additionally, with
+larger cluster sizes, compressed clusters have a smaller limit on host
+cluster mappings (a 2M cluster size requires compressed clusters to
+reside below 512 TB (49 bits), where enlarging this would require an
+incompatible layout change).
+
 Given a offset into the virtual disk, the offset into the image file can be
 obtained as follows:

@@ -427,7 +457,9 @@ Standard Cluster Descriptor:
 Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):

 Bit  0 - x-1:   Host cluster 

[Qemu-block] [PATCH v4 4/5] qcow2: Don't allow overflow during cluster allocation

2018-02-27 Thread Eric Blake
Our code was already checking that we did not attempt to
allocate more clusters than what would fit in an INT64 (the
physical maximimum if we can access a full off_t's worth of
data).  But this does not catch smaller limits enforced by
various spots in the qcow2 image description: L1 and normal
clusters of L2 are documented as having bits 63-56 reserved
for other purposes, capping our maximum offset at 64PB (bit
55 is the maximum bit set).  And for compressed images with
2M clusters, the cap drops the maximum offset to bit 48, or
a maximum offset of 512TB.  If we overflow that offset, we
would write compressed data into one place, but try to
decompress from another, which won't work.

I don't have 512TB handy to prove whether things break if we
compress so much data that we overflow that limit, and don't
think that iotests can (quickly) test it either.  Test 138
comes close (it corrupts an image into thinking something lives
at 32PB, which is half the maximum for L1 sizing - although
it relies on 512-byte clusters).  But that test points out
that we will generally hit other limits first (such as running
out of memory for the refcount table, or exceeding file system
limits like 16TB on ext4, etc), so this is more a theoretical
safety valve than something likely to be hit.

Signed-off-by: Eric Blake 
Reviewed-by: Alberto Garcia 

---
v3: use s->cluster_offset_mask instead of open-coding it [Berto],
use MIN() to clamp offset on small cluster size, add spec patch
first to justify clamping even on refcount allocations
---
 block/qcow2.h  |  6 ++
 block/qcow2-refcount.c | 21 ++---
 2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index afc524ead54..785c8d6e222 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -41,6 +41,12 @@
 #define QCOW_MAX_CRYPT_CLUSTERS 32
 #define QCOW_MAX_SNAPSHOTS 65536

+/* Field widths in qcow2 mean normal cluster offsets cannot reach
+ * 64PB; depending on cluster size, compressed clusters can have a
+ * smaller limit (64PB for up to 16k clusters, then ramps down to
+ * 512TB for 2M clusters).  */
+#define QCOW_MAX_CLUSTER_OFFSET ((1ULL << 56) - 1)
+
 /* 8 MB refcount table is enough for 2 PB images at 64k cluster size
  * (128 GB for 512 byte clusters, 2 EB for 2 MB clusters) */
 #define QCOW_MAX_REFTABLE_SIZE 0x80
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 28afbb1b5ea..c46a2257a9e 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -31,7 +31,8 @@
 #include "qemu/bswap.h"
 #include "qemu/cutils.h"

-static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size);
+static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
+uint64_t max);
 static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs,
 int64_t offset, int64_t length, uint64_t addend,
 bool decrease, enum qcow2_discard_type type);
@@ -362,7 +363,8 @@ static int alloc_refcount_block(BlockDriverState *bs,
 }

 /* Allocate the refcount block itself and mark it as used */
-int64_t new_block = alloc_clusters_noref(bs, s->cluster_size);
+int64_t new_block = alloc_clusters_noref(bs, s->cluster_size,
+ QCOW_MAX_CLUSTER_OFFSET);
 if (new_block < 0) {
 return new_block;
 }
@@ -947,7 +949,8 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs,


 /* return < 0 if error */
-static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size)
+static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size,
+uint64_t max)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t i, nb_clusters, refcount;
@@ -972,9 +975,9 @@ retry:
 }

 /* Make sure that all offsets in the "allocated" range are representable
- * in an int64_t */
+ * in the requested max */
 if (s->free_cluster_index > 0 &&
-s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits))
+s->free_cluster_index - 1 > (max >> s->cluster_bits))
 {
 return -EFBIG;
 }
@@ -994,7 +997,7 @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t 
size)

 BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC);
 do {
-offset = alloc_clusters_noref(bs, size);
+offset = alloc_clusters_noref(bs, size, QCOW_MAX_CLUSTER_OFFSET);
 if (offset < 0) {
 return offset;
 }
@@ -1076,7 +1079,11 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
 free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
 do {
 if (!offset || free_in_cluster < size) {
-int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size);
+int64_t new_cluster;
+
+new_cluster = alloc_clusters_noref(bs, s->cluster_size,
+  

Re: [Qemu-block] [Qemu-devel] [RFC v4 04/21] blockjobs: add status enum

2018-02-27 Thread Eric Blake

On 02/23/2018 05:51 PM, John Snow wrote:

We're about to add several new states, and booleans are becoming
unwieldly and difficult to reason about. It would help to have a
more explicit bookkeeping of the state of blockjobs. To this end,
add a new "status" field and add our existing states in a redundant
manner alongside the bools they are replacing:




Some of these states appear somewhat superfluous, but it helps define the
expected flow of a job; so some of the states wind up being synchronous
empty transitions. Importantly, jobs can be in only one of these states
at any given time, which helps code and external users alike reason about
the current condition of a job unambiguously.

Signed-off-by: John Snow 
---



+++ b/qapi/block-core.json
@@ -955,6 +955,32 @@
  { 'enum': 'BlockJobType',
'data': ['commit', 'stream', 'mirror', 'backup'] }
  
+##

+# @BlockJobStatus:
+#
+# Indicates the present state of a given blockjob in its lifetime.
+#
+# @undefined: Erroneous, default state. Should not ever be visible.
+#
+# @created: The job has been created, but not yet started.
+#
+# @running: The job is currently running.
+#
+# @paused: The job is running, but paused. The pause may be requested by
+#  either the QMP user or by internal processes.
+#
+# @ready: The job is running, but is ready for the user to signal completion.
+# This is used for long-running jobs like mirror that are designed to
+# run indefinitely.
+#
+# @standby: The job is ready, but paused. This is nearly identical to @paused.
+#   The job may return to @ready or otherwise be canceled.
+#
+# Since: 2.12
+##
+{ 'enum': 'BlockJobStatus',
+  'data': ['undefined', 'created', 'running', 'paused', 'ready', 'standby'] }


Do we need 'undefined' in the list if it should never be visible?  I'm 
okay if you keep it because it makes other code easier, but as Kevin 
points out in 5/21, if you aren't even using it, it might be worth 
considering dropping the state altogether.


Otherwise, the patch looks fine to me.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [RFC v4 05/21] blockjobs: add state transition table

2018-02-27 Thread John Snow


On 02/27/2018 11:27 AM, Kevin Wolf wrote:
> Am 24.02.2018 um 00:51 hat John Snow geschrieben:
>> The state transition table has mostly been implied. We're about to make
>> it a bit more complex, so let's make the STM explicit instead.
>>
>> Perform state transitions with a function that for now just asserts the
>> transition is appropriate.
>>
>> Transitions:
>> Undefined -> Created: During job initialization.
>> Created   -> Running: Once the job is started.
>>   Jobs cannot transition from "Created" to "Paused"
>>   directly, but will instead synchronously transition
>>   to running to paused immediately.
>> Running   -> Paused:  Normal workflow for pauses.
>> Running   -> Ready:   Normal workflow for jobs reaching their sync point.
>>   (e.g. mirror)
>> Ready -> Standby: Normal workflow for pausing ready jobs.
>> Paused-> Running: Normal resume.
>> Standby   -> Ready:   Resume of a Standby job.
>>
>>
>> +-+
>> |UNDEFINED|
>> +--+--+
>>|
>> +--v+
>> |CREATED|
>> +--++
>>|
>> +--v+ +--+
>> |RUNNING<->PAUSED|
>> +--++ +--+
>>|
>> +--v--+   +---+
>> |READY<--->STANDBY|
>> +-+   +---+
>>
>>
>> Notably, there is no state presently defined as of this commit that
>> deals with a job after the "running" or "ready" states, so this table
>> will be adjusted alongside the commits that introduce those states.
>>
>> Signed-off-by: John Snow 
>> ---
>>  block/trace-events |  3 +++
>>  blockjob.c | 42 --
>>  2 files changed, 39 insertions(+), 6 deletions(-)
>>
>> diff --git a/block/trace-events b/block/trace-events
>> index 02dd80ff0c..b75a0c8409 100644
>> --- a/block/trace-events
>> +++ b/block/trace-events
>> @@ -4,6 +4,9 @@
>>  bdrv_open_common(void *bs, const char *filename, int flags, const char 
>> *format_name) "bs %p filename \"%s\" flags 0x%x format_name \"%s\""
>>  bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d"
>>  
>> +# blockjob.c
>> +block_job_state_transition(void *job,  int ret, const char *legal, const 
>> char *s0, const char *s1) "job %p (ret: %d) attempting %s transition 
>> (%s-->%s)"
>> +
>>  # block/block-backend.c
>>  blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, int 
>> flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x"
>>  blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, int 
>> flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x"
>> diff --git a/blockjob.c b/blockjob.c
>> index 1be9c20cff..d745b3bb69 100644
>> --- a/blockjob.c
>> +++ b/blockjob.c
>> @@ -28,6 +28,7 @@
>>  #include "block/block.h"
>>  #include "block/blockjob_int.h"
>>  #include "block/block_int.h"
>> +#include "block/trace.h"
>>  #include "sysemu/block-backend.h"
>>  #include "qapi/error.h"
>>  #include "qapi/qmp/qerror.h"
>> @@ -41,6 +42,34 @@
>>   * block_job_enter. */
>>  static QemuMutex block_job_mutex;
>>  
>> +/* BlockJob State Transition Table */
>> +bool BlockJobSTT[BLOCK_JOB_STATUS__MAX][BLOCK_JOB_STATUS__MAX] = {
>> +  /* U, C, R, P, Y, S */
>> +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0},
> 
> Even at the end of the series, this is the only use of
> BLOCK_JOB_STATUS_UNDEFINED.
> 
>> +/* C: */ [BLOCK_JOB_STATUS_CREATED]   = {0, 0, 1, 0, 0, 0},
>> +/* R: */ [BLOCK_JOB_STATUS_RUNNING]   = {0, 0, 0, 1, 1, 0},
>> +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0},
>> +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1},
>> +/* S: */ [BLOCK_JOB_STATUS_STANDBY]   = {0, 0, 0, 0, 1, 0},
>> +};
>> +
>> +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1)
>> +{
>> +BlockJobStatus s0 = job->status;
>> +if (s0 == s1) {
>> +return;
>> +}
>> +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX);
>> +trace_block_job_state_transition(job, job->ret, BlockJobSTT[s0][s1] ?
>> + "allowed" : "disallowed",
>> + 
>> qapi_enum_lookup(_lookup,
>> +  s0),
>> + 
>> qapi_enum_lookup(_lookup,
>> +  s1));
>> +assert(BlockJobSTT[s0][s1]);
>> +job->status = s1;
>> +}
>> +
>>  static void block_job_lock(void)
>>  {
>>  qemu_mutex_lock(_job_mutex);
>> @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job)
>>  job->pause_count--;
>>  job->busy = true;
>>  job->paused = false;
>> -job->status = BLOCK_JOB_STATUS_RUNNING;
>> +block_job_state_transition(job, BLOCK_JOB_STATUS_RUNNING);
>>  bdrv_coroutine_enter(blk_bs(job->blk), job->co);
>>  }
>>  
>> @@ -704,6 +733,7 @@ void *block_job_create(const char *job_id, const 
>> BlockJobDriver *driver,
>>  job->refcnt

Re: [Qemu-block] [Qemu-devel] [RFC] qemu-img: Drop BLK_ZERO from convert

2018-02-27 Thread Stefan Hajnoczi
On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote:
> There are filesystems (among which is tmpfs) that have a hard time
> reporting allocation status.  That is definitely a bug in them.
> 
> However, there is no good reason why qemu-img convert should query the
> allocation status in the first place.  It does zero detection by itself
> anyway, so we can detect unallocated areas ourselves.
> 
> Furthermore, if a filesystem driver has any sense, reading unallocated
> data should take just as much time as lseek(SEEK_DATA) + memset().  So
> the only overhead we introduce by dropping the manual lseek() call is a
> memset() in the driver and a buffer_is_zero() in qemu-img, both of which
> should be relatively quick.

This makes sense.  Which file systems did you test this patch on?

XFS, ext4, and tmpfs would be a good minimal test set to prove the
patch.  Perhaps with two input files:
1. A file that is mostly filled with data.
2. A file that is only sparsely populated with data.

The time taken should be comparable with the time before this patch.


signature.asc
Description: PGP signature


[Qemu-block] [RFC PATCH 1/2] migrate: Allow incoming migration without defer

2018-02-27 Thread Richard Palethorpe
Allow a QEMU instance which has been started and used without the "-incoming"
flag to accept an incoming migration with the "migrate-incoming" QMP
command. This allows the user to dump the VM state to an external file then
revert to that state at a later time without restarting QEMU.
---
 migration/migration.c | 12 +---
 vl.c  |  2 ++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 0aa596f867..05595a4cec 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1321,14 +1321,14 @@ void migrate_del_blocker(Error *reason)
 void qmp_migrate_incoming(const char *uri, Error **errp)
 {
 Error *local_err = NULL;
-static bool once = true;
 
-if (!deferred_incoming) {
-error_setg(errp, "For use with '-incoming defer'");
+if (runstate_check(RUN_STATE_INMIGRATE)) {
+error_setg(errp, "The incoming migration has already been started");
 return;
 }
-if (!once) {
-error_setg(errp, "The incoming migration has already been started");
+
+if (!deferred_incoming) {
+vm_stop(RUN_STATE_INMIGRATE);
 }
 
 qemu_start_incoming_migration(uri, _err);
@@ -1337,8 +1337,6 @@ void qmp_migrate_incoming(const char *uri, Error **errp)
 error_propagate(errp, local_err);
 return;
 }
-
-once = false;
 }
 
 bool migration_is_blocked(Error **errp)
diff --git a/vl.c b/vl.c
index 9e7235df6d..a91eda039e 100644
--- a/vl.c
+++ b/vl.c
@@ -634,6 +634,7 @@ static const RunStateTransition runstate_transitions_def[] 
= {
 { RUN_STATE_PAUSED, RUN_STATE_POSTMIGRATE },
 { RUN_STATE_PAUSED, RUN_STATE_PRELAUNCH },
 { RUN_STATE_PAUSED, RUN_STATE_COLO},
+{ RUN_STATE_PAUSED, RUN_STATE_INMIGRATE },
 
 { RUN_STATE_POSTMIGRATE, RUN_STATE_RUNNING },
 { RUN_STATE_POSTMIGRATE, RUN_STATE_FINISH_MIGRATE },
@@ -665,6 +666,7 @@ static const RunStateTransition runstate_transitions_def[] 
= {
 { RUN_STATE_RUNNING, RUN_STATE_WATCHDOG },
 { RUN_STATE_RUNNING, RUN_STATE_GUEST_PANICKED },
 { RUN_STATE_RUNNING, RUN_STATE_COLO},
+{ RUN_STATE_RUNNING, RUN_STATE_INMIGRATE },
 
 { RUN_STATE_SAVE_VM, RUN_STATE_RUNNING },
 
-- 
2.16.2




[Qemu-block] [RFC PATCH 2/2] migrate: Tests for migrating to an already used QEMU

2018-02-27 Thread Richard Palethorpe
Currently this appears to work for X86_64, but PPC64 fails due to some
unexpected data on the serial port after migration. This probably requires
quite a bit more work.
---
 tests/migration-test.c | 113 +
 1 file changed, 85 insertions(+), 28 deletions(-)

diff --git a/tests/migration-test.c b/tests/migration-test.c
index 74f9361bdd..8217d2e83e 100644
--- a/tests/migration-test.c
+++ b/tests/migration-test.c
@@ -125,19 +125,17 @@ static void init_bootfile_ppc(const char *bootpath)
  * we get an 'A' followed by an endless string of 'B's
  * but on the destination we won't have the A.
  */
-static void wait_for_serial(const char *side)
+static void wait_for_serial(const char *side, gboolean started)
 {
 char *serialpath = g_strdup_printf("%s/%s", tmpfs, side);
 FILE *serialfile = fopen(serialpath, "r");
 const char *arch = qtest_get_arch();
-int started = (strcmp(side, "src_serial") == 0 &&
-   strcmp(arch, "ppc64") == 0) ? 0 : 1;
 
 g_free(serialpath);
 do {
 int readvalue = fgetc(serialfile);
 
-if (!started) {
+if (!started && !strcmp(arch, "ppc64")) {
 /* SLOF prints its banner before starting test,
  * to ignore it, mark the start of the test with '_',
  * ignore all characters until this marker
@@ -358,16 +356,21 @@ static void migrate_set_capability(QTestState *who, const 
char *capability,
 QDECREF(rsp);
 }
 
-static void migrate(QTestState *who, const char *uri)
+static void migrate(QTestState *who, const char *uri, gboolean incoming)
 {
 QDict *rsp;
+const gchar *cmd_name = incoming ? "migrate-incoming" : "migrate";
 gchar *cmd;
 
-cmd = g_strdup_printf("{ 'execute': 'migrate',"
+cmd = g_strdup_printf("{ 'execute': '%s',"
   "'arguments': { 'uri': '%s' } }",
-  uri);
+  cmd_name, uri);
 rsp = qtest_qmp(who, cmd);
 g_free(cmd);
+while (qdict_haskey(rsp, "event")) {
+QDECREF(rsp);
+rsp = qtest_qmp_receive(who);
+}
 g_assert(qdict_haskey(rsp, "return"));
 QDECREF(rsp);
 }
@@ -382,15 +385,22 @@ static void migrate_start_postcopy(QTestState *who)
 }
 
 static void test_migrate_start(QTestState **from, QTestState **to,
-   const char *uri)
+   const char *uri, gboolean incoming)
 {
-gchar *cmd_src, *cmd_dst;
+gchar *cmd_src;
+GString *cmd_dst = g_string_new(NULL);
 char *bootpath = g_strdup_printf("%s/bootsect", tmpfs);
+char *bootpath_dst;
 const char *arch = qtest_get_arch();
 const char *accel = "kvm:tcg";
 
 got_stop = false;
 
+if (incoming)
+bootpath_dst = bootpath;
+else
+bootpath_dst = g_strdup_printf("%s/bootsect_dst", tmpfs);
+
 if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) {
 init_bootfile_x86(bootpath);
 cmd_src = g_strdup_printf("-machine accel=%s -m 150M"
@@ -398,12 +408,14 @@ static void test_migrate_start(QTestState **from, 
QTestState **to,
   " -serial file:%s/src_serial"
   " -drive file=%s,format=raw",
   accel, tmpfs, bootpath);
-cmd_dst = g_strdup_printf("-machine accel=%s -m 150M"
-  " -name target,debug-threads=on"
-  " -serial file:%s/dest_serial"
-  " -drive file=%s,format=raw"
-  " -incoming %s",
-  accel, tmpfs, bootpath, uri);
+if (!incoming)
+init_bootfile_x86(bootpath_dst);
+
+g_string_printf(cmd_dst, "-machine accel=%s -m 150M"
+" -name target,debug-threads=on"
+" -serial file:%s/dest_serial"
+" -drive file=%s,format=raw",
+accel, tmpfs, bootpath_dst);
 } else if (strcmp(arch, "ppc64") == 0) {
 
 /* On ppc64, the test only works with kvm-hv, but not with kvm-pr */
@@ -416,22 +428,29 @@ static void test_migrate_start(QTestState **from, 
QTestState **to,
   " -serial file:%s/src_serial"
   " -drive file=%s,if=pflash,format=raw",
   accel, tmpfs, bootpath);
-cmd_dst = g_strdup_printf("-machine accel=%s -m 256M"
-  " -name target,debug-threads=on"
-  " -serial file:%s/dest_serial"
-  " -incoming %s",
-  accel, tmpfs, uri);
+g_string_printf(cmd_dst, "-machine accel=%s -m 256M"
+" -name target,debug-threads=on"
+" -serial file:%s/dest_serial",
+accel, 

[Qemu-block] [RFC PATCH 0/2] Increase usability of external snapshots

2018-02-27 Thread Richard Palethorpe
Hello,

Following on from the discussion about creating savevm/loadvm QMP
equivalents. I decided to take the advice given that we should use external
snapshots. However reverting to a snapshot currently requires QEMU to be
restarted with "-incoming". Restarting QEMU requires a fair bit of book
keeping to be done by the management application and may take measurably more
time.

Also to quote the previous discussion:
>>> > Then you could just use the regular migrate QMP commands for loading
>>> > and saving snapshots.  Might need a little extra work on the incoming
>>> > side, since we need to be able to load snapshots, despite QEMU not
>>> > being started with '-incoming defer', but might still be doable ?
>>> > This would theoretically give us progress monitoring, cancellation,
>>> > etc for free.
>>> 
>>> What actually stops this working other than the sanity check in
>>> migrate_incoming ?
>>
>> No idea really - not looked closely at the code implications.
>
> It would be a plus for migration code, right now there are _two_
> implementations, and savevm/loadvm one gets less love.
>
> And we will check "much more" the way to load migration in a
> non-pristine qemu, so 
>
> Later, Juan.

This patch just changes a few lines which shuts down the currently running VM
and puts the QEMU instance into a state where it can accept an incoming
migration. I haven't tested it yet with more complex scenarios, but it
appears to work fine so far.

I have also started work on a new migration test, which is almost identical to
the existing post-copy test, but migrates to a non-pristine QEMU. It works on
X86, but not yet on PPC64 because the serial output is different. It needs
quite a lot more work, but before I continue, I would like to know first if
this is something people fundamentally object to?

Richard Palethorpe (2):
  migrate: Allow incoming migration without defer
  migrate: Tests for migrating to an already used QEMU

 migration/migration.c  |  12 +++---
 tests/migration-test.c | 113 +
 vl.c   |   2 +
 3 files changed, 92 insertions(+), 35 deletions(-)

-- 
2.16.2




[Qemu-block] [PATCH v4 0/5] minor compression improvements

2018-02-27 Thread Eric Blake
In v4:
- patch 2 has even more wording tweaks [Kevin]
- patch 3 is new [Berto]
- patch 5 fixes an off-by-one [Berto]

Eric Blake (5):
  qcow2: Prefer byte-based calls into bs->file
  qcow2: Document some maximum size constraints
  qcow2: Reduce REFT_OFFSET_MASK
  qcow2: Don't allow overflow during cluster allocation
  qcow2: Avoid memory over-allocation on compressed images

 docs/interop/qcow2.txt | 38 +++---
 block/qcow2.h  |  8 +++-
 block/qcow2-cluster.c  | 32 
 block/qcow2-refcount.c | 27 +--
 block/qcow2.c  |  2 +-
 5 files changed, 80 insertions(+), 27 deletions(-)

-- 
2.14.3




Re: [Qemu-block] [PATCH] vl: introduce vm_shutdown()

2018-02-27 Thread Stefan Hajnoczi
On Fri, Feb 23, 2018 at 04:20:44PM +0800, Fam Zheng wrote:
> On Tue, 02/20 13:10, Stefan Hajnoczi wrote:
> > 1. virtio_scsi_handle_cmd_vq() racing with iothread_stop_all() hits the
> >virtio_scsi_ctx_check() assertion failure because the BDS AioContext
> >has been modified by iothread_stop_all().
> 
> Does this patch fix the issue completely? IIUC virtio_scsi_handle_cmd can
> already be entered at the time of main thread calling virtio_scsi_clear_aio(),
> so this race condition still exists:
> 
>   main thread   iothread
> -
>   vm_shutdown
> ...
>   virtio_bus_stop_ioeventfd
> virtio_scsi_dataplane_stop
> aio_poll()
>   ...
> 
> virtio_scsi_data_plane_handle_cmd()
>   aio_context_acquire(s->ctx)
>   virtio_scsi_acquire(s).enter
>   virtio_scsi_clear_aio()
>   aio_context_release(s->ctx)
>   virtio_scsi_acquire(s).return
>   virtio_scsi_handle_cmd_vq()
> ...
>   virtqueue_pop()
> 
> Is it possible that the above virtqueue_pop() still returns one element that 
> was
> queued before vm_shutdown() was called?

No, it can't because virtio_scsi_clear_aio() invokes
virtio_queue_host_notifier_aio_read(>host_notifier) to process the
virtqueue.  By the time we get back to iothread's
virtio_scsi_data_plane_handle_cmd() the virtqueue is already empty.

Vcpus have been paused so no additional elements can slip into the
virtqueue.


signature.asc
Description: PGP signature


Re: [Qemu-block] [PATCH v3 2/4] qcow2: Document some maximum size constraints

2018-02-27 Thread Alberto Garcia
> Note that L1 and L2 fields both stop at bit 55 currently, but do have
> room for expansion up to bit 61; so all three limits (if we include
> refcount table in the set capped at bit 55 for now) could be raised
> simultaneously if we find 64P too small in the future.

64 petabytes ought to be enough for anybody :-)

Berto



Re: [Qemu-block] [PATCH v3 3/7] qapi: Replace qobject_to_X(o) by qobject_to(o, X)

2018-02-27 Thread Eric Blake

On 02/26/2018 06:01 AM, Max Reitz wrote:


+++ b/block.c
@@ -1457,7 +1457,7 @@ static QDict *parse_json_filename(const char
*filename, Error **errp)
   return NULL;
   }
   -    options = qobject_to_qdict(options_obj);
+    options = qobject_to(options_obj, QDict);


Bikeshedding - would it read any easier as:

options = qobject_to(QDict, options_obj);

?  If so, your Coccinelle script can be touched up, and patch 2/7 swaps
argument order around, so it would be tolerable but still slightly
busywork to regenerate the series.  But I'm not strongly attached to
either order, and so I'm also willing to take this as-is (especially
since that's less work), if no one else has a strong opinion that
swapping order would aid legibility.


Well, same for me. :-)

In a template/generic language, we'd write the type first (e.g.
qobject_cast(options_obj)).  But maybe we'd write the object
first, too (e.g. options_obj.cast()).  And the current order of
the arguments follows the order in the name ("qobject" options_obj "to"
QDict).  But maybe it's more natural to read it as "qobject to" QDict
"applied to" options_obj.

I don't know either.


Okay, after looking for existing uses of type names in macro calls, I see:

qemu/compiler.h:

#ifndef container_of
#define container_of(ptr, type, member) ({  \
const typeof(((type *) 0)->member) *__mptr = (ptr); \
(type *) ((char *) __mptr - offsetof(type, member));})
#endif

/* Convert from a base type to a parent type, with compile time 
checking.  */

#ifdef __GNUC__
#define DO_UPCAST(type, field, dev) ( __extension__ ( { \
char __attribute__((unused)) offset_must_be_zero[ \
-offsetof(type, field)]; \
container_of(dev, type, field);}))
#else
#define DO_UPCAST(type, field, dev) container_of(dev, type, field)
#endif

#define typeof_field(type, field) typeof(((type *)0)->field)


qapi/clone-visitor.h:

/*
 * Deep-clone QAPI object @src of the given @type, and return the result.
 *
 * Not usable on QAPI scalars (integers, strings, enums), nor on a
 * QAPI object that references the 'any' type.  Safe when @src is NULL.
 */
#define QAPI_CLONE(type, src)   \

/*
 * Copy deep clones of @type members from @src to @dst.
 *
 * Not usable on QAPI scalars (integers, strings, enums), nor on a
 * QAPI object that references the 'any' type.
 */
#define QAPI_CLONE_MEMBERS(type, dst, src)  \


2 out of 3 macros in compiler.h put the type name first, and 
container_of() puts it in the middle of 3.  It's even weirder because 
DO_UPCAST(t, f, d) calls container_of(d, t, f), where the inconsistency 
makes it a mental struggle to figure out how to read the two macros side 
by side, compared to if we had just been consistent.  Meanwhile, all of 
the macros in qapi put the type name first.


So at this point, I'm 70:30 in favor of doing the rename to have 
qobject_to(type, obj) for consistency with majority of other macros that 
take a type name (type names are already unusual as arguments to macros, 
whether or not the macro is named with ALL_CAPS).  (Sorry, I know that 
means more busy work for you, if you agree with my reasoning)


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH v3 2/4] qcow2: Document some maximum size constraints

2018-02-27 Thread Eric Blake

On 02/27/2018 05:47 AM, Kevin Wolf wrote:

Am 22.02.2018 um 16:59 hat Eric Blake geschrieben:

Although off_t permits up to 63 bits (8EB) of file offsets, in
practice, we're going to hit other limits first.  Document some
of those limits in the qcow2 spec, and how choice of cluster size
can influence some of the limits.

While at it, notice that since we cannot map any virtual cluster
to any address higher than 64 PB (56 bits) (due to the L1/L2 field
encoding), it makes little sense to require the refcount table to
access host offsets beyond that point.  Mark the upper bits of
the refcount table entries as reserved, with no ill effects, since
it is unlikely that there are any existing images larger than 64PB
in the first place, and thus all existing images already have those
bits as 0.

Signed-off-by: Eric Blake 


I think it would be good to mention the exact reason for the 56 bits in
the spec. Even this commit message is rather vague ('L1/L2 field
encoding'), so if at some point someone wonders, if we couldn't simply
extend the allowed range, they won't easily see that it's related to
compressed clusters.


Note that L1 and L2 fields both stop at bit 55 currently, but do have 
room for expansion up to bit 61; so all three limits (if we include 
refcount table in the set capped at bit 55 for now) could be raised 
simultaneously if we find 64P too small in the future.


Compressed clusters are also related, but there, the limit is even 
smaller - with 2M clusters, a compressed cluster must reside within the 
first 512T of host offsets, and there are no free bits available for 
allowing additional compressed cluster storage without making an 
incompatible change of how compressed clusters are represented.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [Qemu-devel] [PATCH v2 14/16] block/mirror: Add active mirroring

2018-02-27 Thread Eric Blake

On 02/27/2018 03:34 AM, Fam Zheng wrote:

On Mon, 01/22 23:08, Max Reitz wrote:

@@ -1151,7 +1285,48 @@ static int coroutine_fn 
bdrv_mirror_top_preadv(BlockDriverState *bs,
  static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
  uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
  {
-return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
+MirrorOp *op = NULL;
+MirrorBDSOpaque *s = bs->opaque;
+QEMUIOVector bounce_qiov;
+void *bounce_buf;
+int ret = 0;
+bool copy_to_target;
+
+copy_to_target = s->job->ret >= 0 &&
+ s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
+
+if (copy_to_target) {
+/* The guest might concurrently modify the data to write; but
+ * the data on source and destination must match, so we have
+ * to use a bounce buffer if we are going to write to the
+ * target now. */
+bounce_buf = qemu_blockalign(bs, bytes);
+iov_to_buf_full(qiov->iov, qiov->niov, 0, bounce_buf, bytes);


Quorum doesn't use a bounce buffer, so I think we can get away without it too: a
guest concurrently modifying the buffer isn't a concern in practice.


Arguably, that's a bug in quorum.  We also use a bounce buffer for the 
same reason when encrypting.  We really do need to make sure that bits 
landing in more than one storage location come from the same point in time.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-block] [PATCH v3 2/7] qapi: Add qobject_to()

2018-02-27 Thread Alberto Garcia
On Sat 24 Feb 2018 04:40:28 PM CET, Max Reitz wrote:
> This is a dynamic casting macro that, given a QObject type, returns an
> object as that type or NULL if the object is of a different type (or
> NULL itself).
>
> The macro uses lower-case letters because:
> 1. There does not seem to be a hard rule on whether qemu macros have to
>be upper-cased,
> 2. The current situation in qapi/qmp is inconsistent (compare e.g.
>QINCREF() vs. qdict_put()),
> 3. qobject_to() will evaluate its @obj parameter only once, thus it is
>generally not important to the caller whether it is a macro or not,
> 4. I prefer it aesthetically.
>
> Signed-off-by: Max Reitz 

Reviewed-by: Alberto Garcia 

Berto



Re: [Qemu-block] [PATCH v3 1/7] compiler: Add QEMU_BUILD_BUG_MSG() macro

2018-02-27 Thread Alberto Garcia
On Sat 24 Feb 2018 04:40:27 PM CET, Max Reitz wrote:
> _Static_assert() allows us to specify messages, and that may come in
> handy.  Even without _Static_assert(), encouraging developers to put a
> helpful message next to the QEMU_BUILD_BUG_* may make debugging easier
> whenever it breaks.
>
> Signed-off-by: Max Reitz 

Reviewed-by: Alberto Garcia 

Berto



Re: [Qemu-block] [RFC PATCH 1/2] qcow2: Allow checking and repairing corrupted internal snapshots

2018-02-27 Thread Alberto Garcia
On Mon 26 Feb 2018 02:40:08 PM CET, Max Reitz wrote:
>> +"L1 table is too large; snapshot table entry 
>> corrupted\n",
>> +(fix & BDRV_FIX_ERRORS) ? "Deleting" : "ERROR",
>> +sn->id_str, sn->name, sn->l1_size);
>> +found_corruption = true;
>> +}
>
> This code assumes the snapshot table itself has been valid.  Why should
> it be when it contains garbage entries?
>
>> +
>> +if (found_corruption) {
>> +result->corruptions++;
>> +sn->l1_size = 0; /* Prevent this L1 table from being used */
>> +if (fix & BDRV_FIX_ERRORS) {
>> +ret = qcow2_snapshot_delete(bs, sn->id_str, sn->name, 
>> NULL);
>
> So calling this is actually very dangerous.  It modifies the snapshot
> table which I wouldn't trust is actually just a snapshot table.  It
> could intersect any other structure in the qcow2 image.  Yes, we do an
> overlap check, but that only protects metadata, and I don't really
> want to see an overlap check corruption when repairing the image;
> especially since this means you cannot fix the corruption.

I think you're right. One thing that could be done is check that the
area where the snapshot table is supposed to be doesn't overlap with any
other data structures.

> I don't quite know myself what to do instead, but I guess my main
> point would be: Before any (potentially) destructive changes are made,
> the user should have the chance of still opening the image read-only
> and copying all the data off somewhere else.  Which of course again
> means we shouldn't prevent the user from opening an image because a
> snapshot is broken.

Ok, you've convinced me. I'll see what I can come up with. I guess
initially we can simply add checks for sn->l1_table_offset and
sn->l1_size in the places where they're used.

Berto



Re: [Qemu-block] [PATCH v3 2/4] qcow2: Document some maximum size constraints

2018-02-27 Thread Kevin Wolf
Am 22.02.2018 um 16:59 hat Eric Blake geschrieben:
> Although off_t permits up to 63 bits (8EB) of file offsets, in
> practice, we're going to hit other limits first.  Document some
> of those limits in the qcow2 spec, and how choice of cluster size
> can influence some of the limits.
> 
> While at it, notice that since we cannot map any virtual cluster
> to any address higher than 64 PB (56 bits) (due to the L1/L2 field
> encoding), it makes little sense to require the refcount table to
> access host offsets beyond that point.  Mark the upper bits of
> the refcount table entries as reserved, with no ill effects, since
> it is unlikely that there are any existing images larger than 64PB
> in the first place, and thus all existing images already have those
> bits as 0.
> 
> Signed-off-by: Eric Blake 

I think it would be good to mention the exact reason for the 56 bits in
the spec. Even this commit message is rather vague ('L1/L2 field
encoding'), so if at some point someone wonders, if we couldn't simply
extend the allowed range, they won't easily see that it's related to
compressed clusters.

Kevin



Re: [Qemu-block] [PATCH v2 00/16] block/mirror: Add active-sync mirroring

2018-02-27 Thread Fam Zheng
On Sat, 02/24 16:42, Max Reitz wrote:
> Pïng

Looks good in general. I've left some small questions though.

Fam



Re: [Qemu-block] [PATCH v2 15/16] block/mirror: Add copy mode QAPI interface

2018-02-27 Thread Fam Zheng
On Mon, 01/22 23:08, Max Reitz wrote:
> This patch allows the user to specify whether to use active or only
> passive mode for mirror block jobs.  Currently, this setting will remain

I think you want s/passive/background/ in the whole patch.

> constant for the duration of the entire block job.
> 
> Signed-off-by: Max Reitz 
> ---
>  qapi/block-core.json  | 11 +--
>  include/block/block_int.h |  4 +++-
>  block/mirror.c| 12 +++-
>  blockdev.c|  9 -
>  4 files changed, 27 insertions(+), 9 deletions(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index ba1fd736f5..5fafa5fcac 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -1573,6 +1573,9 @@
>  # written. Both will result in identical contents.
>  # Default is true. (Since 2.4)
>  #
> +# @copy-mode: when to copy data to the destination; defaults to 'passive'
> +# (Since: 2.12)
> +#
>  # Since: 1.3
>  ##
>  { 'struct': 'DriveMirror',
> @@ -1582,7 +1585,7 @@
>  '*speed': 'int', '*granularity': 'uint32',
>  '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
>  '*on-target-error': 'BlockdevOnError',
> -'*unmap': 'bool' } }
> +'*unmap': 'bool', '*copy-mode': 'MirrorCopyMode' } }
>  
>  ##
>  # @BlockDirtyBitmap:
> @@ -1761,6 +1764,9 @@
>  #above @device. If this option is not given, a node name 
> is
>  #autogenerated. (Since: 2.9)
>  #
> +# @copy-mode: when to copy data to the destination; defaults to 'passive'
> +# (Since: 2.12)
> +#
>  # Returns: nothing on success.
>  #
>  # Since: 2.6
> @@ -1781,7 +1787,8 @@
>  '*speed': 'int', '*granularity': 'uint32',
>  '*buf-size': 'int', '*on-source-error': 'BlockdevOnError',
>  '*on-target-error': 'BlockdevOnError',
> -'*filter-node-name': 'str' } }
> +'*filter-node-name': 'str',
> +'*copy-mode': 'MirrorCopyMode' } }
>  
>  ##
>  # @block_set_io_throttle:
> diff --git a/include/block/block_int.h b/include/block/block_int.h
> index 03f3fdd129..1fda4d3d43 100644
> --- a/include/block/block_int.h
> +++ b/include/block/block_int.h
> @@ -948,6 +948,7 @@ void commit_active_start(const char *job_id, 
> BlockDriverState *bs,
>   * @filter_node_name: The node name that should be assigned to the filter
>   * driver that the mirror job inserts into the graph above @bs. NULL means 
> that
>   * a node name should be autogenerated.
> + * @copy_mode: When to trigger writes to the target.
>   * @errp: Error object.
>   *
>   * Start a mirroring operation on @bs.  Clusters that are allocated
> @@ -961,7 +962,8 @@ void mirror_start(const char *job_id, BlockDriverState 
> *bs,
>MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
>BlockdevOnError on_source_error,
>BlockdevOnError on_target_error,
> -  bool unmap, const char *filter_node_name, Error **errp);
> +  bool unmap, const char *filter_node_name,
> +  MirrorCopyMode copy_mode, Error **errp);
>  
>  /*
>   * backup_job_create:
> diff --git a/block/mirror.c b/block/mirror.c
> index 83082adb64..3b23886a5a 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1409,7 +1409,7 @@ static void mirror_start_job(const char *job_id, 
> BlockDriverState *bs,
>   const BlockJobDriver *driver,
>   bool is_none_mode, BlockDriverState *base,
>   bool auto_complete, const char 
> *filter_node_name,
> - bool is_mirror,
> + bool is_mirror, MirrorCopyMode copy_mode,
>   Error **errp)
>  {
>  MirrorBlockJob *s;
> @@ -1515,7 +1515,7 @@ static void mirror_start_job(const char *job_id, 
> BlockDriverState *bs,
>  s->on_target_error = on_target_error;
>  s->is_none_mode = is_none_mode;
>  s->backing_mode = backing_mode;
> -s->copy_mode = MIRROR_COPY_MODE_BACKGROUND;
> +s->copy_mode = copy_mode;
>  s->base = base;
>  s->granularity = granularity;
>  s->buf_size = ROUND_UP(buf_size, granularity);
> @@ -1582,7 +1582,8 @@ void mirror_start(const char *job_id, BlockDriverState 
> *bs,
>MirrorSyncMode mode, BlockMirrorBackingMode backing_mode,
>BlockdevOnError on_source_error,
>BlockdevOnError on_target_error,
> -  bool unmap, const char *filter_node_name, Error **errp)
> +  bool unmap, const char *filter_node_name,
> +  MirrorCopyMode copy_mode, Error **errp)
>  {
>  bool is_none_mode;
>  BlockDriverState *base;
> @@ -1597,7 +1598,7 @@ void mirror_start(const char *job_id, BlockDriverState 
> *bs,
>   speed, granularity, buf_size, 

Re: [Qemu-block] [PATCH v2 14/16] block/mirror: Add active mirroring

2018-02-27 Thread Fam Zheng
On Mon, 01/22 23:08, Max Reitz wrote:
> @@ -1151,7 +1285,48 @@ static int coroutine_fn 
> bdrv_mirror_top_preadv(BlockDriverState *bs,
>  static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs,
>  uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags)
>  {
> -return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
> +MirrorOp *op = NULL;
> +MirrorBDSOpaque *s = bs->opaque;
> +QEMUIOVector bounce_qiov;
> +void *bounce_buf;
> +int ret = 0;
> +bool copy_to_target;
> +
> +copy_to_target = s->job->ret >= 0 &&
> + s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING;
> +
> +if (copy_to_target) {
> +/* The guest might concurrently modify the data to write; but
> + * the data on source and destination must match, so we have
> + * to use a bounce buffer if we are going to write to the
> + * target now. */
> +bounce_buf = qemu_blockalign(bs, bytes);
> +iov_to_buf_full(qiov->iov, qiov->niov, 0, bounce_buf, bytes);

Quorum doesn't use a bounce buffer, so I think we can get away without it too: a
guest concurrently modifying the buffer isn't a concern in practice.

> +
> +qemu_iovec_init(_qiov, 1);
> +qemu_iovec_add(_qiov, bounce_buf, bytes);
> +qiov = _qiov;
> +
> +op = active_write_prepare(s->job, offset, bytes);
> +}
> +
> +ret = bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags);
> +if (ret < 0) {
> +goto out;
> +}
> +
> +if (copy_to_target) {
> +do_sync_target_write(s->job, offset, bytes, qiov, flags);
> +}
> +
> +out:
> +if (copy_to_target) {
> +active_write_settle(op);
> +
> +qemu_iovec_destroy(_qiov);
> +qemu_vfree(bounce_buf);
> +}
> +return ret;
>  }
>  
>  static int coroutine_fn bdrv_mirror_top_flush(BlockDriverState *bs)

Don't you need to update bdrv_mirror_top_pdiscard and bdrv_mirror_top_pwritev
too?

Fam



Re: [Qemu-block] [PATCH v2 12/16] block/mirror: Distinguish active from passive ops

2018-02-27 Thread Fam Zheng
On Mon, 01/22 23:08, Max Reitz wrote:
> Currently, the mirror block job only knows passive operations.  But once
> we introduce active writes, we need to distinguish between the two; for
> example, mirror_wait_for_free_in_flight_slot() should wait for a passive
> operation because active writes will not use the same in-flight slots.
> 
> Signed-off-by: Max Reitz 
> ---
>  block/mirror.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 2363e79563..bb46f3c4e9 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -89,6 +89,7 @@ struct MirrorOp {
>  int64_t *bytes_handled;
>  
>  bool is_pseudo_op;
> +bool is_active_write;
>  CoQueue waiting_requests;
>  
>  QTAILQ_ENTRY(MirrorOp) next;
> @@ -281,8 +282,10 @@ static inline void 
> mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s)
>   * some other operation to start, which may in fact be the
>   * caller of this function.  Since there is only one pseudo op
>   * at any given time, we will always find some real operation
> - * to wait on. */
> -if (!op->is_pseudo_op) {
> + * to wait on.
> + * Also, only non-active operations use up in-flight slots, so
> + * we can ignore active operations. */
> +if (!op->is_pseudo_op && !op->is_active_write) {
>  qemu_co_queue_wait(>waiting_requests, NULL);
>  return;
>  }
> -- 
> 2.14.3
> 

I'd just squash this patch into 14 to avoid code churn.

Fam




Re: [Qemu-block] [PATCH v2 11/16] block/dirty-bitmap: Add bdrv_dirty_iter_next_area

2018-02-27 Thread Fam Zheng
On Mon, 01/22 23:08, Max Reitz wrote:
> This new function allows to look for a consecutively dirty area in a
> dirty bitmap.
> 
> Signed-off-by: Max Reitz 
> ---
>  include/block/dirty-bitmap.h |  2 ++
>  block/dirty-bitmap.c | 51 
> 
>  2 files changed, 53 insertions(+)
> 
> diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h
> index a591c27213..35f3ccc44c 100644
> --- a/include/block/dirty-bitmap.h
> +++ b/include/block/dirty-bitmap.h
> @@ -79,6 +79,8 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
>  void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap,
>  int64_t offset, int64_t bytes);
>  int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter);
> +bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t 
> max_offset,
> +   uint64_t *offset, int *bytes);
>  void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t offset);
>  int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap);
>  int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap);
> diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c
> index 50564fa1e2..484b5dda43 100644
> --- a/block/dirty-bitmap.c
> +++ b/block/dirty-bitmap.c
> @@ -501,6 +501,57 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter)
>  return hbitmap_iter_next(>hbi, true);
>  }
>  
> +/**
> + * Return the next consecutively dirty area in the dirty bitmap
> + * belonging to the given iterator @iter.
> + *
> + * @max_offset: Maximum value that may be returned for
> + *  *offset + *bytes
> + * @offset: Will contain the start offset of the next dirty area
> + * @bytes:  Will contain the length of the next dirty area
> + *
> + * Returns: True if a dirty area could be found before max_offset
> + *  (which means that *offset and *bytes then contain valid
> + *  values), false otherwise.

Also document the change to the iter cursor depending on the return value?

Fam



Re: [Qemu-block] [PATCH v2 10/16] hbitmap: Add @advance param to hbitmap_iter_next()

2018-02-27 Thread Fam Zheng
On Mon, 01/22 23:08, Max Reitz wrote:
> diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
> index 9091c639b3..2a2aa5bd43 100644
> --- a/tests/test-hbitmap.c
> +++ b/tests/test-hbitmap.c
> @@ -46,7 +46,7 @@ static void hbitmap_test_check(TestHBitmapData *data,
>  
>  i = first;
>  for (;;) {
> -next = hbitmap_iter_next();
> +next = hbitmap_iter_next(, true);
>  if (next < 0) {
>  next = data->size;
>  }
> @@ -435,25 +435,25 @@ static void 
> test_hbitmap_iter_granularity(TestHBitmapData *data,
>  /* Note that hbitmap_test_check has to be invoked manually in this test. 
>  */
>  hbitmap_test_init(data, 131072 << 7, 7);
>  hbitmap_iter_init(, data->hb, 0);
> -g_assert_cmpint(hbitmap_iter_next(), <, 0);
> +g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
>  
>  hbitmap_test_set(data, ((L2 + L1 + 1) << 7) + 8, 8);
>  hbitmap_iter_init(, data->hb, 0);
> -g_assert_cmpint(hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7);
> -g_assert_cmpint(hbitmap_iter_next(), <, 0);
> +g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7);
> +g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
>  
>  hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7);
> -g_assert_cmpint(hbitmap_iter_next(), <, 0);
> +g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
>  
>  hbitmap_test_set(data, (131072 << 7) - 8, 8);
>  hbitmap_iter_init(, data->hb, 0);
> -g_assert_cmpint(hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7);
> -g_assert_cmpint(hbitmap_iter_next(), ==, 131071 << 7);
> -g_assert_cmpint(hbitmap_iter_next(), <, 0);
> +g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7);
> +g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7);
> +g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
>  
>  hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7);
> -g_assert_cmpint(hbitmap_iter_next(), ==, 131071 << 7);
> -g_assert_cmpint(hbitmap_iter_next(), <, 0);
> +g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7);
> +g_assert_cmpint(hbitmap_iter_next(, true), <, 0);
>  }

Please add tests for advance=false too.



Re: [Qemu-block] [PATCH v2 06/16] block/mirror: Use CoQueue to wait on in-flight ops

2018-02-27 Thread Fam Zheng
On Mon, 01/22 23:07, Max Reitz wrote:
>  qemu_iovec_destroy(>qiov);
> -g_free(op);
>  
> -if (s->waiting_for_io) {
> -qemu_coroutine_enter(s->common.co);
> -}
> +qemu_co_queue_restart_all(>waiting_requests);
> +g_free(op);

OK, this answers my question to patch 5.

Fam