Re: [Qemu-block] [Qemu-devel] [PATCH] qemu-img: Document --force-share / -U

2017-12-07 Thread Fam Zheng
On Fri, 12/08 08:42, Markus Armbruster wrote:
> Fam Zheng  writes:
> 
> > On Thu, 12/07 10:53, Eric Blake wrote:
> >> On 12/07/2017 04:58 AM, Kashyap Chamarthy wrote:
> >> > On Thu, Dec 07, 2017 at 04:44:53PM +0800, Fam Zheng wrote:
> [...]
> >> >> +it less likely to conflict with a running guest's permissions due to 
> >> >> image
> >> >> +locking. For example, this can be used to get the image information 
> >> >> (with
> >> >> +'info' subcommand) when the image is used by a running guest. Note 
> >> >> that this
> >> >> +could produce inconsistent result because of concurrent metadata 
> >> >> changes, etc..
> >> > 
> >> > Super nit-pick:  an ellipsis[*] is three dots :-), so, when applying you
> >> > might want to: s/../.../
> >> > 
> >> > [*] https://dictionary.cambridge.org/dictionary/english/ellipsis
> >> 
> >> Except that both "etc." and "..." independently convey a sense of
> >> continuation, which means that using both at once is both redundant
> >> (just one will do) and difficult to argue how to typeset (since 'etc.'
> >> is often written with an explicit '.' to emphasize that is an
> >> abbreviation, does that mean you have to write 'etc.''...' for a total
> >> of 4 dots?).
> >
> > I have the impression that "etc." is more correct than "etc"
> 
> It is.
> 
> >  so I used even 
> > at
> > the end of the sensence where there is another period '.', making it 
> > "etc..".
> 
> That's wrong all the same :)
> 
> https://en.wikipedia.org/wiki/Period_(punctuation)#Abbreviations
> 
> > If ending the paragraph with "etc." is enough, we can drop one ".".
> 
> Please do.

Yes, thanks, v2 sent.

Fam



Re: [Qemu-block] [Qemu-devel] [PATCH] qemu-img: Document --force-share / -U

2017-12-07 Thread Markus Armbruster
Fam Zheng  writes:

> On Thu, 12/07 10:53, Eric Blake wrote:
>> On 12/07/2017 04:58 AM, Kashyap Chamarthy wrote:
>> > On Thu, Dec 07, 2017 at 04:44:53PM +0800, Fam Zheng wrote:
[...]
>> >> +it less likely to conflict with a running guest's permissions due to 
>> >> image
>> >> +locking. For example, this can be used to get the image information (with
>> >> +'info' subcommand) when the image is used by a running guest. Note that 
>> >> this
>> >> +could produce inconsistent result because of concurrent metadata 
>> >> changes, etc..
>> > 
>> > Super nit-pick:  an ellipsis[*] is three dots :-), so, when applying you
>> > might want to: s/../.../
>> > 
>> > [*] https://dictionary.cambridge.org/dictionary/english/ellipsis
>> 
>> Except that both "etc." and "..." independently convey a sense of
>> continuation, which means that using both at once is both redundant
>> (just one will do) and difficult to argue how to typeset (since 'etc.'
>> is often written with an explicit '.' to emphasize that is an
>> abbreviation, does that mean you have to write 'etc.''...' for a total
>> of 4 dots?).
>
> I have the impression that "etc." is more correct than "etc"

It is.

>  so I used even at
> the end of the sensence where there is another period '.', making it "etc..".

That's wrong all the same :)

https://en.wikipedia.org/wiki/Period_(punctuation)#Abbreviations

> If ending the paragraph with "etc." is enough, we can drop one ".".

Please do.



Re: [Qemu-block] import thin provisioned disks with image upload

2017-12-07 Thread Gianluca Cecchi
On Thu, Dec 7, 2017 at 9:33 PM, Nir Soffer  wrote:

>
> For transferring images over http, there is no way to avoid sending
> unused blocks, except compression.
>

OK, understood


>> Maybe this file was crated with preallocation=full?
>>>
>>
>> In virt-manager there is not this logic by default.
>> The default format is qcow2/sparse
>> When you create/add a disk you can choose "Select or create custom
>> storage":
>> https://drive.google.com/file/d/1gAwjAG5aRFC5fFgkoZXT5wINvQnov
>> 88p/view?usp=sharing
>>
>> And then choose a format between:
>> raw, qcow, qcow2, qed, vmdk, vpc, vdi
>>
>> In my case it was the default one, so qcow2, as qemu-img correctly
>> detects.
>>
>
> I tested creating qcow2 files with virt manager, and it seems to use
> the preallocation=metadata by default (behind your back), I guess
> for improving performance in some cases. oVirt doe not use this
> option so we don't have this issue when downloading files.
>
>
Ah, ok. I didn't know about this preallocation option of qemu-img
Does your example mean that by default oVirt uses preallocation=off when
creating thin provisoned disks?


> I hope I convinced you this time ;-)
>
> Nir
>

Yes :- ]  and thanks for the deep an detailed explanation, Nir

Thanks also to Eric for his comments in the other reply

Gianluca


[Qemu-block] [PATCH v2] qemu-img: Document --force-share / -U

2017-12-07 Thread Fam Zheng
Signed-off-by: Fam Zheng 

---
v2: - "code{qemu-img}". [Kashyap, Eric]
- "etc.." -> "etc.".
---
 qemu-img.texi | 9 +
 1 file changed, 9 insertions(+)

diff --git a/qemu-img.texi b/qemu-img.texi
index fdcf120f36..6b9f9adfc8 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -57,6 +57,15 @@ exclusive with the @var{-O} parameters. It is currently 
required to also use
 the @var{-n} parameter to skip image creation. This restriction may be relaxed
 in a future release.
 
+@item --force-share (-U)
+
+If specified, @code{qemu-img} will open the image with shared permissions,
+which makes it less likely to conflict with a running guest's permissions due
+to image locking. For example, this can be used to get the image information
+(with 'info' subcommand) when the image is used by a running guest. Note that
+this could produce inconsistent result because of concurrent metadata changes,
+etc.
+
 @item fmt
 is the disk image format. It is guessed automatically in most cases. See below
 for a description of the supported disk formats.
-- 
2.14.3




Re: [Qemu-block] [Qemu-devel] [PATCH] qemu-img: Document --force-share / -U

2017-12-07 Thread Fam Zheng
On Thu, 12/07 10:53, Eric Blake wrote:
> On 12/07/2017 04:58 AM, Kashyap Chamarthy wrote:
> > On Thu, Dec 07, 2017 at 04:44:53PM +0800, Fam Zheng wrote:
> >> Signed-off-by: Fam Zheng 
> >> ---
> >>  qemu-img.texi | 8 
> >>  1 file changed, 8 insertions(+)
> >>
> >> diff --git a/qemu-img.texi b/qemu-img.texi
> >> index fdcf120f36..3aa63aad55 100644
> >> --- a/qemu-img.texi
> >> +++ b/qemu-img.texi
> >> @@ -57,6 +57,14 @@ exclusive with the @var{-O} parameters. It is currently 
> >> required to also use
> >>  the @var{-n} parameter to skip image creation. This restriction may be 
> >> relaxed
> >>  in a future release.
> >>  
> >> +@item --force-share (-U)
> >> +
> >> +If specified, qemu-img will open the image with shared permissions, which 
> >> makes
> > 
> > Does 'texi' requires to quote tools like `qemu-img` (or single quotes),
> > to highlight them in documentation?
> 
> Our usual markup appears to be:
> 
> @code{qemu-img}

Sounds good.

> 
> > 
> >> +it less likely to conflict with a running guest's permissions due to image
> >> +locking. For example, this can be used to get the image information (with
> >> +'info' subcommand) when the image is used by a running guest. Note that 
> >> this
> >> +could produce inconsistent result because of concurrent metadata changes, 
> >> etc..
> > 
> > Super nit-pick:  an ellipsis[*] is three dots :-), so, when applying you
> > might want to: s/../.../
> > 
> > [*] https://dictionary.cambridge.org/dictionary/english/ellipsis
> 
> Except that both "etc." and "..." independently convey a sense of
> continuation, which means that using both at once is both redundant
> (just one will do) and difficult to argue how to typeset (since 'etc.'
> is often written with an explicit '.' to emphasize that is an
> abbreviation, does that mean you have to write 'etc.''...' for a total
> of 4 dots?).

I have the impression that "etc." is more correct than "etc" so I used even at
the end of the sensence where there is another period '.', making it "etc..".

If ending the paragraph with "etc." is enough, we can drop one ".".

Fam



Re: [Qemu-block] import thin provisioned disks with image upload

2017-12-07 Thread Nir Soffer
On Fri, Dec 8, 2017 at 1:35 AM Gianluca Cecchi 
wrote:

> On Thu, Dec 7, 2017 at 9:33 PM, Nir Soffer  wrote:
>
>>
>> For transferring images over http, there is no way to avoid sending
>> unused blocks, except compression.
>>
>
> OK, understood
>
>
>>> Maybe this file was crated with preallocation=full?

>>>
>>> In virt-manager there is not this logic by default.
>>> The default format is qcow2/sparse
>>> When you create/add a disk you can choose "Select or create custom
>>> storage":
>>>
>>> https://drive.google.com/file/d/1gAwjAG5aRFC5fFgkoZXT5wINvQnov88p/view?usp=sharing
>>>
>>> And then choose a format between:
>>> raw, qcow, qcow2, qed, vmdk, vpc, vdi
>>>
>>> In my case it was the default one, so qcow2, as qemu-img correctly
>>> detects.
>>>
>>
>> I tested creating qcow2 files with virt manager, and it seems to use
>> the preallocation=metadata by default (behind your back), I guess
>> for improving performance in some cases. oVirt doe not use this
>> option so we don't have this issue when downloading files.
>>
>>
> Ah, ok. I didn't know about this preallocation option of qemu-img
> Does your example mean that by default oVirt uses preallocation=off when
> creating thin provisoned disks?
>

Correct, we use the default.


>
>
>> I hope I convinced you this time ;-)
>>
>> Nir
>>
>
> Yes :- ]  and thanks for the deep an detailed explanation, Nir
>
> Thanks also to Eric for his comments in the other reply
>
> Gianluca
>
>


Re: [Qemu-block] [Qemu-devel] [PATCH for-2.12 0/4] qmp dirty bitmap API

2017-12-07 Thread John Snow


On 12/07/2017 06:56 AM, Kashyap Chamarthy wrote:
> On Tue, Nov 21, 2017 at 07:10:19PM -0500, John Snow wrote:
>> On 11/21/2017 12:23 PM, Kevin Wolf wrote:
> 
> [...] # Snip lot of good discussion.
> 
>>> On another note, I would double check before adding a new block job type
>>> that this is the right way to go. We have recently noticed that many, if
>>> not all, of the existing block jobs (at least mirror, commit and backup)
>>> are so similar that they implement the same things multiple times and
>>> are just lacking different options and have different bugs. I'm
>>> seriously considering merging all of them into a single job type
>>> internally that just provides options that effectively turn it into one
>>> of the existing job types.
>>>
>>
>> I'm not particularly opposed. At the very, very least "backup" and
>> "mirror" are pretty much the same thing and "stream" and "commit" are
>> basically the same.
>>
>> Forcing the backuppy-job and the consolidatey-job together seems like an
>> ever-so-slightly harder case to make, but I suppose the truth of the
>> matter in all cases is that we're copying data from one node to another...
> 
> So from the above interesting discussion, it seems like Kevin is leaning
> towards a single job type that offers 'stream', 'commit', 'backup', and
> 'mirror' functionality as part of a single command / job type.  Based on
> an instinct, this sounds a bit too stuffy and complex to me.
> 
> And John seems to be leaning towards two block device job types:
> 
>   - 'blockdev-foo' that offers both current 'stream' and 'commit'
> functionality as two different options to the same QMP command; and
> 
>   - 'blockdev-bar' will offer both 'mirror' and 'backup' functionality
> as part of the same QMP command
> 
> FWIW, this seems a bit more palatable, as it is unifying
> similar-functionality-that-differ-slightly into two distinct commands.
> 
> (/me is still wrapping his head around the bitmaps and incremental
> backups infrastructure.)
> 

The discussion you're honing in on is tangential to the one at the heart
of this topic, which is:

"What's the right API for pull model incremental backups?" which digs up
the question "What is the right API for our existing data copying commands?"

Kevin is probably right in the end, though; he usually is. We do need to
unify our data moving logic to avoid fixing the same bugs in four
different but nearly identical jobs.

My only concern is that the special casing will grow too complex for
that one unified job and the job will become so complicated nobody knows
how to work on it anymore without breaking other cases. Maybe this is a
reasonable concern, maybe it isn't.

I'm also not in a hurry to personally unify the loops myself, but if
someone did, they could CC me and I'd review it thoroughly.

We'd find out in a hurry from the implementation if unification was a
win for SLOC.

--js



Re: [Qemu-block] [PATCH v2 6/6] qemu-iotests: add 203 savevm with IOThreads test

2017-12-07 Thread Eric Blake
On 12/07/2017 02:13 PM, Stefan Hajnoczi wrote:
> This test case will prevent future regressions with savevm and
> IOThreads.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  tests/qemu-iotests/203 | 59 
> ++
>  tests/qemu-iotests/203.out |  6 +
>  tests/qemu-iotests/group   |  1 +
>  3 files changed, 66 insertions(+)
>  create mode 100755 tests/qemu-iotests/203
>  create mode 100644 tests/qemu-iotests/203.out
> 

> +#
> +# Creator/Owner: Stefan Hajnoczi 

Again, not sure we need this line.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [ovirt-users] slow performance with export storage on glusterfs

2017-12-07 Thread Nir Soffer
On Thu, Nov 23, 2017 at 3:33 PM Yaniv Kaul  wrote:

> On Thu, Nov 23, 2017 at 1:43 PM, Jiří Sléžka  wrote:
>
>> well, another idea
>>
>> when I did not use the direct flag, the performace was much better
>>
>> 15787360256 bytes (16 GB) copied, 422.955159 s, 37.3 MB/s
>>
>
> That means you were hitting the cache.
>
>
>>
>> probably qemu-img uses direct write too and I understand why. But in
>> case of backup it is not as hot I think. Is there a chance to modify
>> this behavior for backup case? Is it a good idea? Should I fill RFE?
>>
>
> Probably not. We really prefer direct IO to ensure data is consistent.
> Y
>

I did some research in prehistoric oVirt source, and found why we started
to
use direct I/O. We have two issues:

1. reading stale data from storage
2. trashing host cache

If you don't use direct I/O when accessing shared storage, you risk reading
stale data from the kernel buffer cache. This cache may be stale since the
kernel
does not know anything about other hosts writing to the same storage after
the
last read from this storage.

The -t none option in vdsm was introduced because of
https://bugzilla.redhat.com/699976.

The qemu bug https://bugzilla.redhat.com/713743 explains the issue:
qemu-img was writing disk images using writeback and fillingup the cache
buffers
which are then flushed by the kernel preventing other processes from
accessing
the storage. This is particularly bad in cluster environments where
time-based
algorithms might be in place and accessing the storage within certain
timeouts
is critical

I'm not sure it this issue relevant now. We use now sanlock instead of
safelease,
(except for export domain still using safelease), and qemu or kernel may
have
better options to avoid trashing the host cache, or guarantee reliable
access
to storage.

Daivd, do you know if sanlock is effected by trashing the host cache?

Adding also qemu-block mailing list.

Nir


>
>
>>
>> Cheers,
>>
>> Jiri
>>
>>
>> On 11/23/2017 12:26 PM, Jiří Sléžka wrote:
>> > Hi,
>> >
>> > On 11/22/2017 07:30 PM, Nir Soffer wrote:
>> >> On Mon, Nov 20, 2017 at 5:22 PM Jiří Sléžka > >> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> I am trying realize why is exporting of vm to export storage on
>> >> glusterfs such slow.
>> >>
>> >> I am using oVirt and RHV, both instalations on version 4.1.7.
>> >>
>> >> Hosts have dedicated nics for rhevm network - 1gbps, data storage
>> itself
>> >> is on FC.
>> >>
>> >> GlusterFS cluster lives separate on 4 dedicated hosts. It has slow
>> disks
>> >> but I can achieve about 200-400mbit throughput in other
>> applications (we
>> >> are using it for "cold" data, backups mostly).
>> >>
>> >> I am using this glusterfs cluster as backend for export storage.
>> When I
>> >> am exporting vm I can see only about 60-80mbit throughput.
>> >>
>> >> What could be the bottleneck here?
>> >>
>> >> Could it be qemu-img utility?
>> >>
>> >> vdsm  97739  0.3  0.0 354212 29148 ?S> >> /usr/bin/qemu-img convert -p -t none -T none -f raw
>> >>
>>  
>> /rhev/data-center/2ff6d0ee-a10b-473d-b77c-be9149945f5f/ff3cd56a-1005-4426-8137-8f422c0b47c1/images/ba42cbcc-c068-4df8-af3d-00f2077b1e27/c57acd5f-d6cf-48cc-ad0c-4a7d979c0c1e
>> >> -O raw
>> >> /rhev/data-center/mnt/glusterSD/10.20.30.41:
>> _rhv__export/81094499-a392-4ea2-b081-7c6288fbb636/images/ba42cbcc-c068-4df8-af3d-00f2077b1e27/c57acd5f-d6cf-48cc-ad0c-4a7d979c0c1e
>> >>
>> >> Any idea how to make it work faster or what throughput should I
>> >> expected?
>> >>
>> >>
>> >> gluster storage operations are using fuse mount - so every write:
>> >> - travel to the kernel
>> >> - travel back to the gluster fuse helper process
>> >> - travel to all 3 replicas - replication is done on client side
>> >> - return to kernel when all writes succeeded
>> >> - return to caller
>> >>
>> >> So gluster will never set any speed record.
>> >>
>> >> Additionally, you are copying from raw lv on FC - qemu-img cannot do
>> >> anything
>> >> smart and avoid copying unused clusters. Instead if copies gigabytes of
>> >> zeros
>> >> from FC.
>> >
>> > ok, it does make sense
>> >
>> >> However 7.5-10 MiB/s sounds too slow.
>> >>
>> >> I would try to test with dd - how much time it takes to copy
>> >> the same image from FC to your gluster storage?
>> >>
>> >> dd
>> >>
>> if=/rhev/data-center/2ff6d0ee-a10b-473d-b77c-be9149945f5f/ff3cd56a-1005-4426-8137-8f422c0b47c1/images/ba42cbcc-c068-4df8-af3d-00f2077b1e27/c57acd5f-d6cf-48cc-ad0c-4a7d979c0c1e
>> >> of=/rhev/data-center/mnt/glusterSD/10.20.30.41:
>> _rhv__export/81094499-a392-4ea2-b081-7c6288fbb636/__test__
>> >> bs=8M oflag=direct status=progress
>> >
>> > unfrotunately dd performs the same
>> >
>> > 1778384896 bytes (1.8 GB) copied, 198.565265 s, 9.0 MB/s
>> >
>> >
>> >> If dd can do this faster, please ask on qemu-discuss mailing list:
>> >> 

Re: [Qemu-block] [PATCH v2 5/6] iothread: fix iothread_stop() race condition

2017-12-07 Thread Eric Blake
On 12/07/2017 02:13 PM, Stefan Hajnoczi wrote:
> There is a small chance that iothread_stop() hangs as follows:
> 

> 
> The bug is explained by the AioContext->notify_me doc comments:
> 
>   "If this field is 0, everything (file descriptors, bottom halves,
>   timers) will be re-evaluated before the next blocking poll(), thus the
>   event_notifier_set call can be skipped."
> 
> The problem is that "everything" does not include checking
> iothread->stopping.  This means iothread_run() will block in aio_poll()
> if aio_notify() was called just before aio_poll().
> 
> This patch fixes the hang by replacing aio_notify() with
> aio_bh_schedule_oneshot().  This makes aio_poll() or g_main_loop_run()
> to return.

s/to //

> 
> Implementing this properly required a new bool running flag.  The new
> flag prevents races that are tricky if we try to use iothread->stopping.
> Now iothread->stopping is purely for iothread_stop() and
> iothread->running is purely for the iothread_run() thread.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
> I'm including this patch in this series because it is needed to make the
> test case reliable.  It's unrelated to the main goal of the patch
> series.
> ---
Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v2 4/6] iotests: add VM.add_object()

2017-12-07 Thread Eric Blake
On 12/07/2017 02:13 PM, Stefan Hajnoczi wrote:
> The VM.add_object() method can be used to add IOThreads or memory
> backend objects.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  tests/qemu-iotests/iotests.py | 5 +
>  1 file changed, 5 insertions(+)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v2 3/6] blockdev: add x-blockdev-set-iothread force boolean

2017-12-07 Thread Eric Blake
On 12/07/2017 02:13 PM, Stefan Hajnoczi wrote:
> When a node is already associated with a BlockBackend the
> x-blockdev-set-iothread command refuses to set the IOThread.  This is to
> prevent accidentally changing the IOThread when the nodes are in use.
> 
> When the nodes are created with -drive they automatically get a
> BlockBackend.  In that case we know nothing is using them yet and it's
> safe to set the IOThread.  Add a force boolean to override the check.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  qapi/block-core.json |  6 +-
>  blockdev.c   | 11 ++-
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [PATCH v2 2/6] docs: mark nested AioContext locking as a legacy API

2017-12-07 Thread Eric Blake
On 12/07/2017 02:13 PM, Stefan Hajnoczi wrote:
> See the patch for why nested AioContext locking is no longer allowed.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  docs/devel/multiple-iothreads.txt | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [ovirt-users] Scheduling daily Snapshot

2017-12-07 Thread Nir Soffer
On Wed, Dec 6, 2017 at 6:02 PM Jason Lelievre 
wrote:

> Hello,
>
> What is the best way to set up a daily live snapshot for all VM, and have
> the possibility to recover, for example, a specific VM to a specific day?
>

Each snapshot you create make reads and writes slower, as qemu has to
lookup data through the entire chain.

When we take a snapshot, we create a new file (or block device) and make
the new file the active layer of the chain.

For example, assuming you have a base image in raw format:

image-1.raw (top)

After taking a snapshot, you have:

image-1.raw <- image-2.qcow2 (top)

Now when qemu need to read data from the image, it will try to get the data
from
the top layer (image-2), if it does not exists it will try the backing file
(image-1).
Same when writing data, if qemu need to write small amount of data, it may
need
to get entire sector from a another layer in the chain and copy it to the
top layer.

So after a month of backups, will have 30 layer in the chain. I don't know
how
much slower it will be, I never measured this.

For backups, I think you like to have something like a daily backup for the
last week,
a weekly backup for the last month, and so on.

For daily backup of the last week you can take a snapshot every day, and
delete
the 7th snapshot every day. Then backup the snapshot bellow the active
layer.
You can download the snapshot using the SDK, see the download_snapshot
example (requires 4.2). All this can be done while the vm is running.

On your backup, you will have a snapshot per day. You can keep all snapshot
or merge old snapshot using qemu-img commit.

It is also should be possible to reconstruct an image on storage from your
backup
by uploading the the snapshots back to storage - but it is tricky to get
right.

I think the best option for users is to get some 3rd party solution based
on
the oVirt SDK.

Adding qemu-block mailing list, they can probably add more info on how much
you pay for large chain of images, and have more ideas about backing up
qcow2 chains.

I use a Hyperconverged Infrastructure with 3 nodes, gluster storage.
>

Maybe a solution based on gluster or the layers below (e.g. lvm snapshots)
will
be easier to use and more efficient. You can ask in the gluster mailing
list.

Adding Sahina to help with the gluster side.

Nir

>
>

>
> Thank you,
> ___
> Users mailing list
> us...@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/users
>


Re: [Qemu-block] [PATCH v2 1/6] block: avoid recursive AioContext acquire in bdrv_inactivate_all()

2017-12-07 Thread Eric Blake
On 12/07/2017 02:13 PM, Stefan Hajnoczi wrote:
> From: Paolo Bonzini 
> 
> BDRV_POLL_WHILE() does not support recursive AioContext locking.  It
> only releases the AioContext lock once regardless of how many times the
> caller has acquired it.  This results in a hang since the IOThread does
> not make progress while the AioContext is still locked.
> 
> The following steps trigger the hang:
> 
>   $ qemu-system-x86_64 -M accel=kvm -m 1G -cpu host \
>-object iothread,id=iothread0 \
>-device virtio-scsi-pci,iothread=iothread0 \
>-drive if=none,id=drive0,file=test.img,format=raw \
>-device scsi-hd,drive=drive0 \
>-drive if=none,id=drive1,file=test.img,format=raw \
>-device scsi-hd,drive=drive1
>   $ qemu-system-x86_64 ...same options... \
>-incoming tcp::1234
>   (qemu) migrate tcp:127.0.0.1:1234
>   ...hang...
> 
> Tested-by: Stefan Hajnoczi 
> Signed-off-by: Paolo Bonzini 
> ---
>  block.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 7/9] iothread: add iothread_by_id() API

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> Encapsulate IOThread QOM object lookup so that callers don't need to
> know how and where IOThread objects live.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  include/sysemu/iothread.h | 1 +
>  iothread.c| 7 +++
>  2 files changed, 8 insertions(+)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 9/9] qemu-iotests: add 202 external snapshots IOThread test

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> QMP 'transaction' blockdev-snapshot-sync with multiple disks in an
> IOThread is an untested code path.  Several bugs have been found in
> connection with this command.  This patch adds a test case to prevent
> future regressions.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  tests/qemu-iotests/202 | 95 
> ++
>  tests/qemu-iotests/202.out | 11 ++
>  tests/qemu-iotests/group   |  1 +
>  3 files changed, 107 insertions(+)
>  create mode 100755 tests/qemu-iotests/202
>  create mode 100644 tests/qemu-iotests/202.out
> 

> +# Creator/Owner: Stefan Hajnoczi 

There's another cleanup series posted that drops these sorts of lines in
favor of git history.

> +++ b/tests/qemu-iotests/202.out
> @@ -0,0 +1,11 @@
> +Launching VM...
> +Adding IOThread...
> +{u'return': {}}
> +Adding blockdevs...
> +{u'return': {}}
> +{u'return': {}}
> +Setting iothread...
> +{u'return': {}}
> +{u'return': {}}
> +Creating external snapshots...
> +{u'return': {}}

Yay - a python test with sane output - if I had to, I could actually
debug this test without too much effort!

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 8/9] blockdev: add x-blockdev-set-iothread testing command

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> Currently there is no easy way for iotests to ensure that a BDS is bound
> to a particular IOThread.  Normally the virtio-blk device calls
> blk_set_aio_context() when dataplane is enabled during guest driver
> initialization.  This never happens in iotests since -machine
> accel=qtest means there is no guest activity (including device driver
> initialization).
> 
> This patch adds a QMP command to explicitly assign IOThreads in test
> cases.  See qapi/block-core.json for a description of the command.

The x- prefix is perfect for this.

> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  qapi/block-core.json | 36 
>  blockdev.c   | 41 +
>  2 files changed, 77 insertions(+)
> 

> +##
> +# @x-blockdev-set-iothread:
> +#
> +# Move @node and its children into the @iothread.  If @iothread is null then
> +# move @node and its children into the main loop.
> +#
> +# The node must not be attached to a BlockBackend.
> +#
> +# @node-name: the name of the block driver node
> +#
> +# @iothread: the name of the IOThread object or null for the main loop
> +#
> +# Note: this command is experimental and intended for test cases that need
> +# control over IOThreads only.

I'd place 'only' sooner; it fits better as 'intended only for ...'.

As a wording tweak is minor,
Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 6/9] block: drop unused BlockDirtyBitmapState->aio_context field

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> The dirty bitmap actions in qmp_transaction have not used AioContext
> since the dirty bitmap locking discipline was introduced in commit
> 2119882c7eb7e2c612b24fc0c8d86f5887d6f1c3 ("block: introduce
> dirty_bitmap_mutex").  Remove the unused field.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  blockdev.c | 13 -
>  1 file changed, 13 deletions(-)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 5/9] block: don't keep AioContext acquired after internal_snapshot_prepare()

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi 
> ---
>  blockdev.c | 47 +++
>  1 file changed, 31 insertions(+), 16 deletions(-)
> 

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 4/9] block: don't keep AioContext acquired after blockdev_backup_prepare()

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi 
> ---
>  blockdev.c | 44 ++--
>  1 file changed, 34 insertions(+), 10 deletions(-)
> 

>  state->job = do_blockdev_backup(backup, common->block_job_txn, 
> _err);
>  if (local_err) {
>  error_propagate(errp, local_err);
> -return;
> +goto out;
>  }
> +
> +out:
> +aio_context_release(aio_context);
>  }

Again a weird one-shot label where fallthrough would do the same.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 3/9] block: don't keep AioContext acquired after drive_backup_prepare()

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> Signed-off-by: Stefan Hajnoczi 
> ---
>  blockdev.c | 42 ++
>  1 file changed, 34 insertions(+), 8 deletions(-)
> 

> @@ -1867,24 +1867,36 @@ static void drive_backup_prepare(BlkActionState 
> *common, Error **errp)
>  return;
>  }
>  
> -/* AioContext is released in .clean() */
> -state->aio_context = bdrv_get_aio_context(bs);
> -aio_context_acquire(state->aio_context);
> +aio_context = bdrv_get_aio_context(bs);
> +aio_context_acquire(aio_context);
> +
> +/* Paired with .clean() */
>  bdrv_drained_begin(bs);
> +
>  state->bs = bs;
>  
>  state->job = do_drive_backup(backup, common->block_job_txn, _err);
>  if (local_err) {
>  error_propagate(errp, local_err);
> -return;
> +goto out;
>  }
> +
> +out:
> +aio_context_release(aio_context);

Looks a bit funny to have a label with a single use that would fall
through to the same location anyways.  But it's fine from the future
maintenance point of view, so no need to change it.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] import thin provisioned disks with image upload

2017-12-07 Thread Eric Blake
On 12/07/2017 02:33 PM, Nir Soffer wrote:

> $ truncate -s 1g empty
> 
> $ stat empty
>   File: 'empty'
>   Size: 1073741824 Blocks: 0  IO Block: 4096   regular file
>   ...
> 
> $ qemu-img info empty
> image: empty
> file format: raw
> virtual size: 1.0G (1073741824 bytes)
> disk size: 0
> 
> The value "disk size" used by qemu-img is confusing and not useful
> when you want to transfer the file to another host.
> 
> I don't know why qemu-img display this value instead of the actual
> file size, adding qemu-block mailing list in case someone can explain
> this.

For regular files, qemu is reporting the stat Blocks value (a completely
sparse file occupies 0 blocks, and therefore uses 0 bytes of the host
filesystem).

> 
> When you upload or download this file, you will transfer 1g of zeros.

Well, depending on whether your transfer mechanism has optimizations for
transferring holes efficiently, you may not have to send 1G of zeroes
over the wire; but yes, the receiving end must reconstruct a file
containing 1G of zeroes (even if that reconstruction is also sparse, and
occupies 0 blocks of the file system).


>> If I go and check on my exported file system I get this
>>
>> [root@ovirt01 NFS_DOMAIN]# find . -name
>> "3c68d43f-0f28-4564-b557-d390a125daa6"
>>
>> ./572eabe7-15d0-42c2-8fa9-0bd773e22e2e/images/3c68d43f-0f28-4564-b557-d390a125daa6
>> [root@ovirt01 NFS_DOMAIN]# ls -lsh
>> ./572eabe7-15d0-42c2-8fa9-0bd773e22e2e/images/3c68d43f-0f28-4564-b557-d390a125daa6
>> total 8.6G
>> 8.6G -rw-rw. 1 vdsm kvm  10G Dec  7 08:42
>> 09ad8e53-0b22-4fe3-b718-d14352b8290a

> 
> Yes this is raw sparse file.
> 
> 
>> 1.3G is the used size on the file system, we cannot upload only used
>>> blocks.

You have to understand something about the way file systems work.  Just
because the filesystem reports 1.3G used, does NOT mean that 1.3G is
contiguous; depending on how scatter-shot its allocation patterns have
been, and the minimum granularity for holes in the host filesystem,
there could very well be gaps that the guest is reporting as unused but
which still contribute towards the in-use block count reported by the host.

Furthermore, when you delete a file, most filesystems do not immediately
rewrite the underlying storage back to 0, but just update metadata in
the filesystem to state that the storage can be reused later for a new
file.  The TRIM operation on filesystems can help reclaim some of that
(now-unused) space, and more and more filesystems are starting to
support the Linux ioctls for punching holes in the middle of existing
files to quit occupying blocks.  But the fact that the host filesystem
claims 8.6G of used blocks is a typical symptom of not having TRIM'ed
the guest filesystem recently, or a case where guest TRIMs do not reach
through all the layers into a host hole-punching operation.

>>> qemu-img info "Disk size" is not the file size the the used size, not
>>> useful
>>> for upload.

Disk size of a raw file is how many host blocks are currently in use;
but you are correct that it does not necessarily tell you how much data
the guest filesystem is currently reporting as used - on the other hand,
it is a good conservative upper bound for how much storage you need to
reproduce the same bit pattern on the destination (even if many of the
bits don't need to be reproduced, because the guest filesystem didn't
care about them) - but only insofar as you can preserve sparseness of
the portions of the file that are holes in the source.

You may also be interested in the virt-sparsify tool from libguestfs,
which is very good at squeezing out unneeded space from an image in
order to get a smaller version that still produces the same content that
the guest cares about.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH V3] pci: removed the is_express field since a uniform interface was inserted

2017-12-07 Thread Eduardo Habkost
On Tue, Dec 05, 2017 at 07:17:06PM +0200, Yoni Bettan wrote:
> * according to Eduardo Habkost's commit
>   fd3b02c8896d597dd8b9e053dec579cf0386aee1
> 
> * since all PCIEs now implement INTERFACE_PCIE_DEVICE we
>   don't need this field anymore
> 
> * Devices that where only INTERFACE_PCIE_DEVICE (is_express == 1)
>   or
>   devices that where only INTERFACE_CONVENTIONAL_PCI_DEVICE 
> (is_express == 0)
>   where not affected by the change
> 
>   The only devices that were affected are those that are hybrid and 
> also
>   had (is_express == 1) - therefor only:
> - hw/vfio/pci.c
> - hw/usb/hcd-xhci.c
> 
>   For both I made sure that QEMU_PCI_CAP_EXPRESS is on
> 
> Signed-off-by: Yoni Bettan 
> ---
>  docs/pcie_pci_bridge.txt   | 2 +-
>  hw/block/nvme.c| 1 -
>  hw/net/e1000e.c| 1 -
>  hw/pci-bridge/pcie_pci_bridge.c| 1 -
>  hw/pci-bridge/pcie_root_port.c | 1 -
>  hw/pci-bridge/xio3130_downstream.c | 1 -
>  hw/pci-bridge/xio3130_upstream.c   | 1 -
>  hw/pci-host/xilinx-pcie.c  | 1 -
>  hw/pci/pci.c   | 4 +++-
>  hw/scsi/megasas.c  | 4 
>  hw/usb/hcd-xhci.c  | 7 ++-
>  hw/vfio/pci.c  | 3 ++-
>  include/hw/pci/pci.h   | 3 ---
>  13 files changed, 12 insertions(+), 18 deletions(-)
> 
> diff --git a/docs/pcie_pci_bridge.txt b/docs/pcie_pci_bridge.txt
> index 5a4203f97c..ab35ebf3ca 100644
> --- a/docs/pcie_pci_bridge.txt
> +++ b/docs/pcie_pci_bridge.txt
> @@ -110,5 +110,5 @@ To enable device hot-plug into the bridge on Linux 
> there're 3 ways:
>  Implementation
>  ==
>  The PCIE-PCI bridge is based on PCI-PCI bridge, but also accumulates PCI 
> Express
> -features as a PCI Express device (is_express=1).
> +features as a PCI Express device.
>  
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 441e21ed1f..9325bc0911 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1087,7 +1087,6 @@ static void nvme_class_init(ObjectClass *oc, void *data)
>  pc->vendor_id = PCI_VENDOR_ID_INTEL;
>  pc->device_id = 0x5845;
>  pc->revision = 2;
> -pc->is_express = 1;
>  
>  set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
>  dc->desc = "Non-Volatile Memory Express";
> diff --git a/hw/net/e1000e.c b/hw/net/e1000e.c
> index f1af279e8d..c360f0d8c9 100644
> --- a/hw/net/e1000e.c
> +++ b/hw/net/e1000e.c
> @@ -675,7 +675,6 @@ static void e1000e_class_init(ObjectClass *class, void 
> *data)
>  c->revision = 0;
>  c->romfile = "efi-e1000e.rom";
>  c->class_id = PCI_CLASS_NETWORK_ETHERNET;
> -c->is_express = 1;
>  
>  dc->desc = "Intel 82574L GbE Controller";
>  dc->reset = e1000e_qdev_reset;
> diff --git a/hw/pci-bridge/pcie_pci_bridge.c b/hw/pci-bridge/pcie_pci_bridge.c
> index a4d827c99d..b7d9ebbec2 100644
> --- a/hw/pci-bridge/pcie_pci_bridge.c
> +++ b/hw/pci-bridge/pcie_pci_bridge.c
> @@ -169,7 +169,6 @@ static void pcie_pci_bridge_class_init(ObjectClass 
> *klass, void *data)
>  DeviceClass *dc = DEVICE_CLASS(klass);
>  HotplugHandlerClass *hc = HOTPLUG_HANDLER_CLASS(klass);
>  
> -k->is_express = 1;
>  k->is_bridge = 1;
>  k->vendor_id = PCI_VENDOR_ID_REDHAT;
>  k->device_id = PCI_DEVICE_ID_REDHAT_PCIE_BRIDGE;
> diff --git a/hw/pci-bridge/pcie_root_port.c b/hw/pci-bridge/pcie_root_port.c
> index 9b6e4ce512..45f9e8cd4a 100644
> --- a/hw/pci-bridge/pcie_root_port.c
> +++ b/hw/pci-bridge/pcie_root_port.c
> @@ -145,7 +145,6 @@ static void rp_class_init(ObjectClass *klass, void *data)
>  DeviceClass *dc = DEVICE_CLASS(klass);
>  PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
>  
> -k->is_express = 1;
>  k->is_bridge = 1;
>  k->config_write = rp_write_config;
>  k->realize = rp_realize;
> diff --git a/hw/pci-bridge/xio3130_downstream.c 
> b/hw/pci-bridge/xio3130_downstream.c
> index 1e09d2afb7..613a0d6bb7 100644
> --- a/hw/pci-bridge/xio3130_downstream.c
> +++ b/hw/pci-bridge/xio3130_downstream.c
> @@ -177,7 +177,6 @@ static void xio3130_downstream_class_init(ObjectClass 
> *klass, void *data)
>  DeviceClass *dc = DEVICE_CLASS(klass);
>  PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
>  
> -k->is_express = 1;
>  k->is_bridge = 1;
>  k->config_write = xio3130_downstream_write_config;
>  k->realize = xio3130_downstream_realize;
> diff --git a/hw/pci-bridge/xio3130_upstream.c 
> b/hw/pci-bridge/xio3130_upstream.c
> index 227997ce46..d4645bddee 100644
> --- a/hw/pci-bridge/xio3130_upstream.c
> +++ b/hw/pci-bridge/xio3130_upstream.c
> @@ -148,7 +148,6 @@ static void xio3130_upstream_class_init(ObjectClass 
> *klass, void *data)
>  DeviceClass *dc = DEVICE_CLASS(klass);
>  PCIDeviceClass *k = PCI_DEVICE_CLASS(klass);
>  
> -k->is_express = 1;
>  k->is_bridge = 1;
>  k->config_write = 

Re: [Qemu-block] [Qemu-devel] [PATCH v2 2/9] block: don't keep AioContext acquired after external_snapshot_prepare()

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> It is not necessary to hold AioContext across transactions anymore since
> bdrv_drained_begin/end() is used to keep the nodes quiesced.  In fact,
> using the AioContext lock for this purpose was always buggy.
> 
> This patch reduces the scope of AioContext locked regions.  This is not
> just a cleanup but also fixes hangs that occur in BDRV_POLL_WHILE()
> because it is unware of recursive locking and does not release the

s/unware/unaware/

> AioContext the necessary number of times to allow progress to be made.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  blockdev.c | 71 
> ++
>  1 file changed, 48 insertions(+), 23 deletions(-)
> 

> @@ -1662,31 +1662,32 @@ static void external_snapshot_prepare(BlkActionState 
> *common,
>  return;
>  }
>  
> -/* Acquire AioContext now so any threads operating on old_bs stop */
> -state->aio_context = bdrv_get_aio_context(state->old_bs);
> -aio_context_acquire(state->aio_context);
> +aio_context = bdrv_get_aio_context(state->old_bs);
> +aio_context_acquire(aio_context);
> +
> +/* Paired with .clean() */
>  bdrv_drained_begin(state->old_bs);
>  
>  if (!bdrv_is_inserted(state->old_bs)) {
>  error_setg(errp, QERR_DEVICE_HAS_NO_MEDIUM, device);
> -return;
> +goto out;
>  }

Using gcc/clang's __attribute__((cleanup)) would make this so much nicer ;)

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH v6 15/20] vdi: Avoid bitrot of debugging code

2017-12-07 Thread Eric Blake
Rework the debug define so that we always get -Wformat checking,
even when debugging is disabled.

Signed-off-by: Eric Blake 
Reviewed-by: Stefan Weil 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v2-v5: no change
---
 block/vdi.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index 8da5dfc897..6f83221ddc 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -86,12 +86,18 @@
 #define DEFAULT_CLUSTER_SIZE (1 * MiB)

 #if defined(CONFIG_VDI_DEBUG)
-#define logout(fmt, ...) \
-fprintf(stderr, "vdi\t%-24s" fmt, __func__, ##__VA_ARGS__)
+#define VDI_DEBUG 1
 #else
-#define logout(fmt, ...) ((void)0)
+#define VDI_DEBUG 0
 #endif

+#define logout(fmt, ...) \
+do {\
+if (VDI_DEBUG) {\
+fprintf(stderr, "vdi\t%-24s" fmt, __func__, ##__VA_ARGS__); \
+}   \
+} while (0)
+
 /* Image signature. */
 #define VDI_SIGNATURE 0xbeda107f

-- 
2.14.3




[Qemu-block] [PATCH v6 18/20] vpc: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the vpc driver accordingly.  Drop the now-unused
get_sector_offset().

Signed-off-by: Eric Blake 

---
v5: fix incorrect rounding in 'map' and bad loop condition [Vladimir]
v4: rebase to interface tweak
v3: rebase to master
v2: drop get_sector_offset() [Kevin], rebase to mapping flag
---
 block/vpc.c | 43 ++-
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/block/vpc.c b/block/vpc.c
index 1576d7b595..b5b9b4 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -705,53 +705,54 @@ fail:
 return ret;
 }

-static int64_t coroutine_fn vpc_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn vpc_co_block_status(BlockDriverState *bs,
+bool want_zero,
+int64_t offset, int64_t bytes,
+int64_t *pnum, int64_t *map,
+BlockDriverState **file)
 {
 BDRVVPCState *s = bs->opaque;
 VHDFooter *footer = (VHDFooter*) s->footer_buf;
-int64_t start, offset;
+int64_t image_offset;
 bool allocated;
-int64_t ret;
+int ret;
 int n;

 if (be32_to_cpu(footer->type) == VHD_FIXED) {
-*pnum = nb_sectors;
+*pnum = bytes;
+*map = offset;
 *file = bs->file->bs;
-return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
-   (sector_num << BDRV_SECTOR_BITS);
+return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
 }

 qemu_co_mutex_lock(>lock);

-offset = get_image_offset(bs, sector_num << BDRV_SECTOR_BITS, false, NULL);
-start = offset;
-allocated = (offset != -1);
+image_offset = get_image_offset(bs, offset, false, NULL);
+allocated = (image_offset != -1);
 *pnum = 0;
 ret = 0;

 do {
 /* All sectors in a block are contiguous (without using the bitmap) */
-n = ROUND_UP(sector_num + 1, s->block_size / BDRV_SECTOR_SIZE)
-  - sector_num;
-n = MIN(n, nb_sectors);
+n = ROUND_UP(offset + 1, s->block_size) - offset;
+n = MIN(n, bytes);

 *pnum += n;
-sector_num += n;
-nb_sectors -= n;
+offset += n;
+bytes -= n;
 /* *pnum can't be greater than one block for allocated
  * sectors since there is always a bitmap in between. */
 if (allocated) {
 *file = bs->file->bs;
-ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | start;
+*map = image_offset;
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 break;
 }
-if (nb_sectors == 0) {
+if (bytes == 0) {
 break;
 }
-offset = get_image_offset(bs, sector_num << BDRV_SECTOR_BITS, false,
-  NULL);
-} while (offset == -1);
+image_offset = get_image_offset(bs, offset, false, NULL);
+} while (image_offset == -1);

 qemu_co_mutex_unlock(>lock);
 return ret;
@@ -1097,7 +1098,7 @@ static BlockDriver bdrv_vpc = {

 .bdrv_co_preadv = vpc_co_preadv,
 .bdrv_co_pwritev= vpc_co_pwritev,
-.bdrv_co_get_block_status   = vpc_co_get_block_status,
+.bdrv_co_block_status   = vpc_co_block_status,

 .bdrv_get_info  = vpc_get_info,

-- 
2.14.3




Re: [Qemu-block] [Qemu-devel] [PATCH v2 1/9] blockdev: hold AioContext for bdrv_unref() in external_snapshot_clean()

2017-12-07 Thread Eric Blake
On 12/06/2017 08:45 AM, Stefan Hajnoczi wrote:
> bdrv_unref() requires the AioContext lock because bdrv_flush() uses
> BDRV_POLL_WHILE(), which assumes the AioContext is currently held.  If
> BDRV_POLL_WHILE() runs without AioContext held the
> pthread_mutex_unlock() call in aio_context_release() fails.
> 
> This patch moves bdrv_unref() into the AioContext locked region to solve
> the following pthread_mutex_unlock() failure:
> 

> diff --git a/blockdev.c b/blockdev.c
> index 56a6b24a0b..3c8d994ced 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -1812,8 +1812,8 @@ static void external_snapshot_clean(BlkActionState 
> *common)
>   DO_UPCAST(ExternalSnapshotState, common, 
> common);
>  if (state->aio_context) {
>  bdrv_drained_end(state->old_bs);
> -aio_context_release(state->aio_context);
>  bdrv_unref(state->new_bs);
> +aio_context_release(state->aio_context);

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH v6 20/20] block: Drop unused .bdrv_co_get_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Now that all drivers have been updated to provide the
byte-based .bdrv_co_block_status(), we can delete the sector-based
interface.

Signed-off-by: Eric Blake 

---
v6: rebase to changes in patch 1, drop R-b
v5: rebase to master
v4: rebase to interface tweak
v3: no change
v2: rebase to earlier changes
---
 include/block/block_int.h |  3 ---
 block/io.c| 50 ++-
 2 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 42e5c37a63..6f8df6a847 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -217,9 +217,6 @@ struct BlockDriver {
  * as well as non-NULL pnum, map, and file; in turn, the driver
  * must return an error or set pnum to an aligned non-zero value.
  */
-int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum,
-BlockDriverState **file);
 int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
 bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
 int64_t *map, BlockDriverState **file);
diff --git a/block/io.c b/block/io.c
index 109b3826cd..e0376c1d7c 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1893,7 +1893,7 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,

 /* Must be non-NULL or bdrv_getlength() would have failed */
 assert(bs->drv);
-if (!bs->drv->bdrv_co_get_block_status && !bs->drv->bdrv_co_block_status) {
+if (!bs->drv->bdrv_co_block_status) {
 *pnum = bytes;
 ret = BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED;
 if (offset + bytes == total_size) {
@@ -1911,53 +1911,23 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,

 /* Round out to request_alignment boundaries */
 align = bs->bl.request_alignment;
-if (bs->drv->bdrv_co_get_block_status && align < BDRV_SECTOR_SIZE) {
-align = BDRV_SECTOR_SIZE;
-}
 aligned_offset = QEMU_ALIGN_DOWN(offset, align);
 aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;

-if (bs->drv->bdrv_co_get_block_status) {
-int count; /* sectors */
-int64_t longret;
-
-assert(QEMU_IS_ALIGNED(aligned_offset | aligned_bytes,
-   BDRV_SECTOR_SIZE));
-/*
- * The contract allows us to return pnum smaller than bytes, even
- * if the next query would see the same status; we truncate the
- * request to avoid overflowing the driver's 32-bit interface.
- */
-longret = bs->drv->bdrv_co_get_block_status(
-bs, aligned_offset >> BDRV_SECTOR_BITS,
-MIN(INT_MAX, aligned_bytes) >> BDRV_SECTOR_BITS, ,
-_file);
-if (longret < 0) {
-assert(INT_MIN <= longret);
-ret = longret;
-goto out;
-}
-if (longret & BDRV_BLOCK_OFFSET_VALID) {
-local_map = longret & BDRV_BLOCK_OFFSET_MASK;
-}
-ret = longret & ~BDRV_BLOCK_OFFSET_MASK;
-*pnum = count * BDRV_SECTOR_SIZE;
-} else {
-ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-aligned_bytes, pnum, _map,
-_file);
-if (ret < 0) {
-*pnum = 0;
-goto out;
-}
-assert(*pnum); /* The block driver must make progress */
+ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+aligned_bytes, pnum, _map,
+_file);
+if (ret < 0) {
+*pnum = 0;
+goto out;
 }

 /*
- * The driver's result must be a multiple of request_alignment.
+ * The driver's result must be a non-zero multiple of request_alignment.
  * Clamp pnum and adjust map to original request.
  */
-assert(QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset);
+assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) &&
+   align > offset - aligned_offset);
 *pnum -= offset - aligned_offset;
 if (*pnum > bytes) {
 *pnum = bytes;
-- 
2.14.3




[Qemu-block] [PATCH v6 16/20] vdi: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the vdi driver accordingly.  Note that the
TODO is already covered (the block layer guarantees bounds of its
requests), and that we can remove the now-unused s->block_sectors.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: fix pnum when offset rounded down to block_size [Vladimir]
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping flag
---
 block/vdi.c | 33 +
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index 6f83221ddc..f309562d36 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -171,8 +171,6 @@ typedef struct {
 uint32_t *bmap;
 /* Size of block (bytes). */
 uint32_t block_size;
-/* Size of block (sectors). */
-uint32_t block_sectors;
 /* First sector of block map. */
 uint32_t bmap_sector;
 /* VDI header (converted to host endianness). */
@@ -462,7 +460,6 @@ static int vdi_open(BlockDriverState *bs, QDict *options, 
int flags,
 bs->total_sectors = header.disk_size / SECTOR_SIZE;

 s->block_size = header.block_size;
-s->block_sectors = header.block_size / SECTOR_SIZE;
 s->bmap_sector = header.offset_bmap / SECTOR_SIZE;
 s->header = header;

@@ -508,33 +505,29 @@ static int vdi_reopen_prepare(BDRVReopenState *state,
 return 0;
 }

-static int64_t coroutine_fn vdi_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn vdi_co_block_status(BlockDriverState *bs,
+bool want_zero,
+int64_t offset, int64_t bytes,
+int64_t *pnum, int64_t *map,
+BlockDriverState **file)
 {
-/* TODO: Check for too large sector_num (in bdrv_is_allocated or here). */
 BDRVVdiState *s = (BDRVVdiState *)bs->opaque;
-size_t bmap_index = sector_num / s->block_sectors;
-size_t sector_in_block = sector_num % s->block_sectors;
-int n_sectors = s->block_sectors - sector_in_block;
+size_t bmap_index = offset / s->block_size;
+size_t index_in_block = offset % s->block_size;
 uint32_t bmap_entry = le32_to_cpu(s->bmap[bmap_index]);
-uint64_t offset;
 int result;

-logout("%p, %" PRId64 ", %d, %p\n", bs, sector_num, nb_sectors, pnum);
-if (n_sectors > nb_sectors) {
-n_sectors = nb_sectors;
-}
-*pnum = n_sectors;
+logout("%p, %" PRId64 ", %" PRId64 ", %p\n", bs, offset, bytes, pnum);
+*pnum = MIN(s->block_size - index_in_block, bytes);
 result = VDI_IS_ALLOCATED(bmap_entry);
 if (!result) {
 return 0;
 }

-offset = s->header.offset_data +
-  (uint64_t)bmap_entry * s->block_size +
-  sector_in_block * SECTOR_SIZE;
+*map = s->header.offset_data + (uint64_t)bmap_entry * s->block_size +
+index_in_block;
 *file = bs->file->bs;
-return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

 static int coroutine_fn
@@ -902,7 +895,7 @@ static BlockDriver bdrv_vdi = {
 .bdrv_child_perm  = bdrv_format_default_perms,
 .bdrv_create = vdi_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
-.bdrv_co_get_block_status = vdi_co_get_block_status,
+.bdrv_co_block_status = vdi_co_block_status,
 .bdrv_make_empty = vdi_make_empty,

 .bdrv_co_preadv = vdi_co_preadv,
-- 
2.14.3




[Qemu-block] [PATCH v6 14/20] sheepdog: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the sheepdog driver accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: no change
v4: update to interface tweak
v3: no change
v2: rebase to mapping flag
---
 block/sheepdog.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index 696a71442a..0af8b07892 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -2988,19 +2988,19 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState 
*bs, int64_t offset,
 return acb.ret;
 }

-static coroutine_fn int64_t
-sd_co_get_block_status(BlockDriverState *bs, int64_t sector_num, int 
nb_sectors,
-   int *pnum, BlockDriverState **file)
+static coroutine_fn int
+sd_co_block_status(BlockDriverState *bs, bool want_zero, int64_t offset,
+   int64_t bytes, int64_t *pnum, int64_t *map,
+   BlockDriverState **file)
 {
 BDRVSheepdogState *s = bs->opaque;
 SheepdogInode *inode = >inode;
 uint32_t object_size = (UINT32_C(1) << inode->block_size_shift);
-uint64_t offset = sector_num * BDRV_SECTOR_SIZE;
 unsigned long start = offset / object_size,
-  end = DIV_ROUND_UP((sector_num + nb_sectors) *
- BDRV_SECTOR_SIZE, object_size);
+  end = DIV_ROUND_UP(offset + bytes, object_size);
 unsigned long idx;
-int64_t ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset;
+*map = offset;
+int ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;

 for (idx = start; idx < end; idx++) {
 if (inode->data_vdi_id[idx] == 0) {
@@ -3017,9 +3017,9 @@ sd_co_get_block_status(BlockDriverState *bs, int64_t 
sector_num, int nb_sectors,
 }
 }

-*pnum = (idx - start) * object_size / BDRV_SECTOR_SIZE;
-if (*pnum > nb_sectors) {
-*pnum = nb_sectors;
+*pnum = (idx - start) * object_size;
+if (*pnum > bytes) {
+*pnum = bytes;
 }
 if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID) {
 *file = bs;
@@ -3097,7 +3097,7 @@ static BlockDriver bdrv_sheepdog = {
 .bdrv_co_writev = sd_co_writev,
 .bdrv_co_flush_to_disk  = sd_co_flush_to_disk,
 .bdrv_co_pdiscard = sd_co_pdiscard,
-.bdrv_co_get_block_status = sd_co_get_block_status,
+.bdrv_co_block_status   = sd_co_block_status,

 .bdrv_snapshot_create   = sd_snapshot_create,
 .bdrv_snapshot_goto = sd_snapshot_goto,
@@ -3133,7 +3133,7 @@ static BlockDriver bdrv_sheepdog_tcp = {
 .bdrv_co_writev = sd_co_writev,
 .bdrv_co_flush_to_disk  = sd_co_flush_to_disk,
 .bdrv_co_pdiscard = sd_co_pdiscard,
-.bdrv_co_get_block_status = sd_co_get_block_status,
+.bdrv_co_block_status   = sd_co_block_status,

 .bdrv_snapshot_create   = sd_snapshot_create,
 .bdrv_snapshot_goto = sd_snapshot_goto,
@@ -3169,7 +3169,7 @@ static BlockDriver bdrv_sheepdog_unix = {
 .bdrv_co_writev = sd_co_writev,
 .bdrv_co_flush_to_disk  = sd_co_flush_to_disk,
 .bdrv_co_pdiscard = sd_co_pdiscard,
-.bdrv_co_get_block_status = sd_co_get_block_status,
+.bdrv_co_block_status   = sd_co_block_status,

 .bdrv_snapshot_create   = sd_snapshot_create,
 .bdrv_snapshot_goto = sd_snapshot_goto,
-- 
2.14.3




[Qemu-block] [PATCH v6 07/20] iscsi: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the iscsi driver accordingly.  In this case,
it is handy to teach iscsi_co_block_status() to handle a NULL map
and file parameter, even though the block layer passes non-NULL
values, because we also call the function directly.  For now, there
are no optimizations done based on the want_zero flag.

We can also make the simplification of asserting that the block
layer passed in aligned values.

Signed-off-by: Eric Blake 

---
v5: assert rather than check for alignment
v4: rebase to interface tweaks
v3: no change
v2: rebase to mapping parameter
---
 block/iscsi.c | 67 ---
 1 file changed, 32 insertions(+), 35 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 4896d50d6e..4c35403e27 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -647,28 +647,28 @@ out_unlock:



-static int64_t coroutine_fn iscsi_co_get_block_status(BlockDriverState *bs,
-  int64_t sector_num,
-  int nb_sectors, int *pnum,
-  BlockDriverState **file)
+static int coroutine_fn iscsi_co_block_status(BlockDriverState *bs,
+  bool want_zero, int64_t offset,
+  int64_t bytes, int64_t *pnum,
+  int64_t *map,
+  BlockDriverState **file)
 {
 IscsiLun *iscsilun = bs->opaque;
 struct scsi_get_lba_status *lbas = NULL;
 struct scsi_lba_status_descriptor *lbasd = NULL;
 struct IscsiTask iTask;
-int64_t ret;
+int ret;

 iscsi_co_init_iscsitask(iscsilun, );

-if (!is_sector_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
-ret = -EINVAL;
-goto out;
-}
+assert(QEMU_IS_ALIGNED(offset | bytes, iscsilun->block_size));

 /* default to all sectors allocated */
-ret = BDRV_BLOCK_DATA;
-ret |= (sector_num << BDRV_SECTOR_BITS) | BDRV_BLOCK_OFFSET_VALID;
-*pnum = nb_sectors;
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+if (map) {
+*map = offset;
+}
+*pnum = bytes;

 /* LUN does not support logical block provisioning */
 if (!iscsilun->lbpme) {
@@ -678,7 +678,7 @@ static int64_t coroutine_fn 
iscsi_co_get_block_status(BlockDriverState *bs,
 qemu_mutex_lock(>mutex);
 retry:
 if (iscsi_get_lba_status_task(iscsilun->iscsi, iscsilun->lun,
-  sector_qemu2lun(sector_num, iscsilun),
+  offset / iscsilun->block_size,
   8 + 16, iscsi_co_generic_cb,
   ) == NULL) {
 ret = -ENOMEM;
@@ -717,12 +717,12 @@ retry:

 lbasd = >descriptors[0];

-if (sector_qemu2lun(sector_num, iscsilun) != lbasd->lba) {
+if (offset / iscsilun->block_size != lbasd->lba) {
 ret = -EIO;
 goto out_unlock;
 }

-*pnum = sector_lun2qemu(lbasd->num_blocks, iscsilun);
+*pnum = lbasd->num_blocks * iscsilun->block_size;

 if (lbasd->provisioning == SCSI_PROVISIONING_TYPE_DEALLOCATED ||
 lbasd->provisioning == SCSI_PROVISIONING_TYPE_ANCHORED) {
@@ -733,15 +733,13 @@ retry:
 }

 if (ret & BDRV_BLOCK_ZERO) {
-iscsi_allocmap_set_unallocated(iscsilun, sector_num * BDRV_SECTOR_SIZE,
-   *pnum * BDRV_SECTOR_SIZE);
+iscsi_allocmap_set_unallocated(iscsilun, offset, *pnum);
 } else {
-iscsi_allocmap_set_allocated(iscsilun, sector_num * BDRV_SECTOR_SIZE,
- *pnum * BDRV_SECTOR_SIZE);
+iscsi_allocmap_set_allocated(iscsilun, offset, *pnum);
 }

-if (*pnum > nb_sectors) {
-*pnum = nb_sectors;
+if (*pnum > bytes) {
+*pnum = bytes;
 }
 out_unlock:
 qemu_mutex_unlock(>mutex);
@@ -749,7 +747,7 @@ out:
 if (iTask.task != NULL) {
 scsi_free_scsi_task(iTask.task);
 }
-if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID) {
+if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID && file) {
 *file = bs;
 }
 return ret;
@@ -788,25 +786,24 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState 
*bs,
  nb_sectors * BDRV_SECTOR_SIZE) &&
 !iscsi_allocmap_is_allocated(iscsilun, sector_num * BDRV_SECTOR_SIZE,
  nb_sectors * BDRV_SECTOR_SIZE)) {
-int pnum;
-BlockDriverState *file;
+int64_t pnum;
 /* check the block status from the beginning of the cluster
  * containing the start sector */
-int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS;
-int head;
-int64_t ret;
+int64_t head;
+int ret;

-

Re: [Qemu-block] import thin provisioned disks with image upload

2017-12-07 Thread Nir Soffer
On Thu, Dec 7, 2017 at 10:15 AM Gianluca Cecchi 
wrote:

> On Thu, Dec 7, 2017 at 1:28 AM, Nir Soffer  wrote:
>
>>
>>
>> On Thu, Dec 7, 2017 at 1:22 AM Gianluca Cecchi 
>> wrote:
>>
>>> On Wed, Dec 6, 2017 at 11:42 PM, Nir Soffer  wrote:
>>>


>
> BTW: I notice that the disk seems preallocated even if original qcow2
> is thin... is this expected?
> This obviously also impacts the time to upload (20Gb virtual disk with
> actual 1.2Gb occupied needs the time equivalent for full 20Gb...)
>

 We upload exactly the file you provided, there is no way we can upload
 20G from 1.2G file :-)

>>>
>>> But the upload process at a medium rate of 40-50MB/s has last about 9
>>> minutes that confirms the 20Gb size
>>> The disk at source has been created as virtio type and format qcow2 from
>>> virt-manager and then only installed a CentOS 7.2 OS with infrastructure
>>> server configuration.
>>> Apart from qemu-img also ls:
>>> # ls -lhs c7lab1.qcow2
>>> 1.3G -rw--- 1 root root 21G Dec  6 23:05 c7lab1.qcow2
>>>
>>
>> The fiile size is 21G - matching what you see. This is the size we upload.
>>
>
> I changed the subject to better address and track the thread..
> You are not convincing me...
>

Trying harder...


> The term "file size" is ambiguous in this context...
>

It is not. file size is what you get from stat:

$ truncate -s 1g empty

$ stat empty
  File: 'empty'
  Size: 1073741824 Blocks: 0  IO Block: 4096   regular file
  ...

$ qemu-img info empty
image: empty
file format: raw
virtual size: 1.0G (1073741824 bytes)
disk size: 0

The value "disk size" used by qemu-img is confusing and not useful
when you want to transfer the file to another host.

I don't know why qemu-img display this value instead of the actual
file size, adding qemu-block mailing list in case someone can explain
this.

When you upload or download this file, you will transfer 1g of zeros.


>
> If I take a disk that was born on this oVirt environment itself I have
> this:
>
> https://drive.google.com/file/d/1Dtb69EKa8adNYiwUDs2d-plSMZIM0Bzr/view?usp=sharing
>
> that shows that is thin provisioned
> Its id shown here:
>
> https://drive.google.com/file/d/1PvLjJVQ3JR4_6v5Da7ATac5ReEObxH-A/view?usp=sharing
>
> If I go and check on my exported file system I get this
>
> [root@ovirt01 NFS_DOMAIN]# find . -name
> "3c68d43f-0f28-4564-b557-d390a125daa6"
>
> ./572eabe7-15d0-42c2-8fa9-0bd773e22e2e/images/3c68d43f-0f28-4564-b557-d390a125daa6
> [root@ovirt01 NFS_DOMAIN]# ls -lsh
> ./572eabe7-15d0-42c2-8fa9-0bd773e22e2e/images/3c68d43f-0f28-4564-b557-d390a125daa6
> total 8.6G
> 8.6G -rw-rw. 1 vdsm kvm  10G Dec  7 08:42
> 09ad8e53-0b22-4fe3-b718-d14352b8290a
> 1.0M -rw-rw. 1 vdsm kvm 1.0M May  1  2016
> 09ad8e53-0b22-4fe3-b718-d14352b8290a.lease
> 4.0K -rw-r--r--. 1 vdsm kvm  317 Jun 21 10:41
> 09ad8e53-0b22-4fe3-b718-d14352b8290a.meta
>
> [root@ovirt01 NFS_DOMAIN]# qemu-img info
> ./572eabe7-15d0-42c2-8fa9-0bd773e22e2e/images/3c68d43f-0f28-4564-b557-d390a125daa6/09ad8e53-0b22-4fe3-b718-d14352b8290a
> image:
> ./572eabe7-15d0-42c2-8fa9-0bd773e22e2e/images/3c68d43f-0f28-4564-b557-d390a125daa6/09ad8e53-0b22-4fe3-b718-d14352b8290a
> file format: raw
> virtual size: 10G (10737418240 bytes)
> disk size: 8.5G
> [root@ovirt01 NFS_DOMAIN]#
>
> So it seems it is raw/sparse
> https://www.ovirt.org/documentation/admin-guide/chap-Virtual_Machine_Disks/
>

Yes this is raw sparse file.


> 1.3G is the used size on the file system, we cannot upload only used
>> blocks.
>> qemu-img info "Disk size" is not the file size the the used size, not
>> useful
>> for upload.
>>
>
> Why not? You can detect the format of source and use a similar strategy of
> the bugzilla I'm referring (see below)
> Or couldn't you use something like
>

That bug is not related. It was about  converting raw sparse format
to qcow2.

In 4.2 we use qemu-img map and the same algorithm used in qemu to
estimate the required size needed to convert raw volume to qcow2 volume.

For transferring images over http, there is no way to avoid sending
unused blocks, except compression.


>
>  rsync -av --sparse source target
>

You can use rsync to transfer images if you like, but we support
only http.


>
> if the target is NFS?
>
>
>> Maybe this file was crated with preallocation=full?
>>
>
> In virt-manager there is not this logic by default.
> The default format is qcow2/sparse
> When you create/add a disk you can choose "Select or create custom
> storage":
>
> https://drive.google.com/file/d/1gAwjAG5aRFC5fFgkoZXT5wINvQnov88p/view?usp=sharing
>
> And then choose a format between:
> raw, qcow, qcow2, qed, vmdk, vpc, vdi
>
> In my case it was the default one, so qcow2, as qemu-img correctly detects.
>

I tested creating qcow2 files with virt manager, and it seems to use
the preallocation=metadata by default (behind your back), I guess
for improving performance 

[Qemu-block] [PATCH v6 19/20] vvfat: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the vvfat driver accordingly.  Note that we
can rely on the block driver having already clamped limits to our
block size, and simplify accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: no change
v4: rebase to interface tweak
v3: no change
v2: rebase to earlier changes, simplify
---
 block/vvfat.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index a690595f2c..37d85a389f 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -3086,15 +3086,13 @@ vvfat_co_pwritev(BlockDriverState *bs, uint64_t offset, 
uint64_t bytes,
 return ret;
 }

-static int64_t coroutine_fn vvfat_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *n, BlockDriverState **file)
+static int coroutine_fn vvfat_co_block_status(BlockDriverState *bs,
+  bool want_zero, int64_t offset,
+  int64_t bytes, int64_t *n,
+  int64_t *map,
+  BlockDriverState **file)
 {
-*n = bs->total_sectors - sector_num;
-if (*n > nb_sectors) {
-*n = nb_sectors;
-} else if (*n < 0) {
-return 0;
-}
+*n = bytes;
 return BDRV_BLOCK_DATA;
 }

@@ -3255,7 +3253,7 @@ static BlockDriver bdrv_vvfat = {

 .bdrv_co_preadv = vvfat_co_preadv,
 .bdrv_co_pwritev= vvfat_co_pwritev,
-.bdrv_co_get_block_status = vvfat_co_get_block_status,
+.bdrv_co_block_status   = vvfat_co_block_status,
 };

 static void bdrv_vvfat_init(void)
-- 
2.14.3




[Qemu-block] [PATCH v6 17/20] vmdk: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the vmdk driver accordingly.  Drop the
now-unused vmdk_find_index_in_cluster().

Also, fix a pre-existing bug: if find_extent() fails (unlikely,
since the block layer did a bounds check), then we must return a
failure, rather than 0.

Signed-off-by: Eric Blake 

---
v5: drop dead code [Vladimir], return error on find_extent() failure
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping flag
---
 block/vmdk.c | 38 ++
 1 file changed, 14 insertions(+), 24 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index c665bcc977..8ae36e7a45 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -1295,33 +1295,27 @@ static inline uint64_t 
vmdk_find_offset_in_cluster(VmdkExtent *extent,
 return extent_relative_offset % cluster_size;
 }

-static inline uint64_t vmdk_find_index_in_cluster(VmdkExtent *extent,
-  int64_t sector_num)
-{
-uint64_t offset;
-offset = vmdk_find_offset_in_cluster(extent, sector_num * 
BDRV_SECTOR_SIZE);
-return offset / BDRV_SECTOR_SIZE;
-}
-
-static int64_t coroutine_fn vmdk_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn vmdk_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t offset, int64_t bytes,
+ int64_t *pnum, int64_t *map,
+ BlockDriverState **file)
 {
 BDRVVmdkState *s = bs->opaque;
 int64_t index_in_cluster, n, ret;
-uint64_t offset;
+uint64_t cluster_offset;
 VmdkExtent *extent;

-extent = find_extent(s, sector_num, NULL);
+extent = find_extent(s, offset >> BDRV_SECTOR_BITS, NULL);
 if (!extent) {
-return 0;
+return -EIO;
 }
 qemu_co_mutex_lock(>lock);
-ret = get_cluster_offset(bs, extent, NULL,
- sector_num * 512, false, ,
+ret = get_cluster_offset(bs, extent, NULL, offset, false, _offset,
  0, 0);
 qemu_co_mutex_unlock(>lock);

-index_in_cluster = vmdk_find_index_in_cluster(extent, sector_num);
+index_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
 switch (ret) {
 case VMDK_ERROR:
 ret = -EIO;
@@ -1336,18 +1330,14 @@ static int64_t coroutine_fn 
vmdk_co_get_block_status(BlockDriverState *bs,
 ret = BDRV_BLOCK_DATA;
 if (!extent->compressed) {
 ret |= BDRV_BLOCK_OFFSET_VALID;
-ret |= (offset + (index_in_cluster << BDRV_SECTOR_BITS))
-& BDRV_BLOCK_OFFSET_MASK;
+*map = cluster_offset + index_in_cluster;
 }
 *file = extent->file->bs;
 break;
 }

-n = extent->cluster_sectors - index_in_cluster;
-if (n > nb_sectors) {
-n = nb_sectors;
-}
-*pnum = n;
+n = extent->cluster_sectors * BDRV_SECTOR_SIZE - index_in_cluster;
+*pnum = MIN(n, bytes);
 return ret;
 }

@@ -2393,7 +2383,7 @@ static BlockDriver bdrv_vmdk = {
 .bdrv_close   = vmdk_close,
 .bdrv_create  = vmdk_create,
 .bdrv_co_flush_to_disk= vmdk_co_flush,
-.bdrv_co_get_block_status = vmdk_co_get_block_status,
+.bdrv_co_block_status = vmdk_co_block_status,
 .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size,
 .bdrv_has_zero_init   = vmdk_has_zero_init,
 .bdrv_get_specific_info   = vmdk_get_specific_info,
-- 
2.14.3




[Qemu-block] [PATCH v6 10/20] qcow: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the qcow driver accordingly.  There is no
intent to optimize based on the want_zero flag for this format.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: no change
v4: rebase to interface tweak
v3: rebase to master
v2: rebase to mapping flag
---
 block/qcow.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index 9569deeaf0..f09661dc0c 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -513,23 +513,28 @@ static int get_cluster_offset(BlockDriverState *bs,
 return 1;
 }

-static int64_t coroutine_fn qcow_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn qcow_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t offset, int64_t bytes,
+ int64_t *pnum, int64_t *map,
+ BlockDriverState **file)
 {
 BDRVQcowState *s = bs->opaque;
-int index_in_cluster, n, ret;
+int index_in_cluster, ret;
+int64_t n;
 uint64_t cluster_offset;

 qemu_co_mutex_lock(>lock);
-ret = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0, _offset);
+ret = get_cluster_offset(bs, offset, 0, 0, 0, 0, _offset);
 qemu_co_mutex_unlock(>lock);
 if (ret < 0) {
 return ret;
 }
-index_in_cluster = sector_num & (s->cluster_sectors - 1);
-n = s->cluster_sectors - index_in_cluster;
-if (n > nb_sectors)
-n = nb_sectors;
+index_in_cluster = offset & (s->cluster_size - 1);
+n = s->cluster_size - index_in_cluster;
+if (n > bytes) {
+n = bytes;
+}
 *pnum = n;
 if (!cluster_offset) {
 return 0;
@@ -537,9 +542,9 @@ static int64_t coroutine_fn 
qcow_co_get_block_status(BlockDriverState *bs,
 if ((cluster_offset & QCOW_OFLAG_COMPRESSED) || s->crypto) {
 return BDRV_BLOCK_DATA;
 }
-cluster_offset |= (index_in_cluster << BDRV_SECTOR_BITS);
+*map = cluster_offset | index_in_cluster;
 *file = bs->file->bs;
-return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | cluster_offset;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

 static int decompress_buffer(uint8_t *out_buf, int out_buf_size,
@@ -,7 +1116,7 @@ static BlockDriver bdrv_qcow = {

 .bdrv_co_readv  = qcow_co_readv,
 .bdrv_co_writev = qcow_co_writev,
-.bdrv_co_get_block_status   = qcow_co_get_block_status,
+.bdrv_co_block_status   = qcow_co_block_status,

 .bdrv_make_empty= qcow_make_empty,
 .bdrv_co_pwritev_compressed = qcow_co_pwritev_compressed,
-- 
2.14.3




[Qemu-block] [PATCH v6 11/20] qcow2: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the qcow2 driver accordingly.

For now, we are ignoring the 'want_zero' hint.  However, it should
be relatively straightforward to honor the hint as a way to return
larger *pnum values when we have consecutive clusters with the same
data/zero status but which differ only in having non-consecutive
mappings.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: no change
v4: update to interface tweak
v3: no change
v2: rebase to mapping flag
---
 block/qcow2.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 1914a940e5..642fa3bc8f 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1637,32 +1637,34 @@ static void qcow2_join_options(QDict *options, QDict 
*old_options)
 }
 }

-static int64_t coroutine_fn qcow2_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn qcow2_co_block_status(BlockDriverState *bs,
+  bool want_zero,
+  int64_t offset, int64_t count,
+  int64_t *pnum, int64_t *map,
+  BlockDriverState **file)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t cluster_offset;
 int index_in_cluster, ret;
 unsigned int bytes;
-int64_t status = 0;
+int status = 0;

-bytes = MIN(INT_MAX, nb_sectors * BDRV_SECTOR_SIZE);
+bytes = MIN(INT_MAX, count);
 qemu_co_mutex_lock(>lock);
-ret = qcow2_get_cluster_offset(bs, sector_num << BDRV_SECTOR_BITS, ,
-   _offset);
+ret = qcow2_get_cluster_offset(bs, offset, , _offset);
 qemu_co_mutex_unlock(>lock);
 if (ret < 0) {
 return ret;
 }

-*pnum = bytes >> BDRV_SECTOR_BITS;
+*pnum = bytes;

 if (cluster_offset != 0 && ret != QCOW2_CLUSTER_COMPRESSED &&
 !s->crypto) {
-index_in_cluster = sector_num & (s->cluster_sectors - 1);
-cluster_offset |= (index_in_cluster << BDRV_SECTOR_BITS);
+index_in_cluster = offset & (s->cluster_size - 1);
+*map = cluster_offset | index_in_cluster;
 *file = bs->file->bs;
-status |= BDRV_BLOCK_OFFSET_VALID | cluster_offset;
+status |= BDRV_BLOCK_OFFSET_VALID;
 }
 if (ret == QCOW2_CLUSTER_ZERO_PLAIN || ret == QCOW2_CLUSTER_ZERO_ALLOC) {
 status |= BDRV_BLOCK_ZERO;
@@ -4355,7 +4357,7 @@ BlockDriver bdrv_qcow2 = {
 .bdrv_child_perm  = bdrv_format_default_perms,
 .bdrv_create= qcow2_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
-.bdrv_co_get_block_status = qcow2_co_get_block_status,
+.bdrv_co_block_status = qcow2_co_block_status,

 .bdrv_co_preadv = qcow2_co_preadv,
 .bdrv_co_pwritev= qcow2_co_pwritev,
-- 
2.14.3




[Qemu-block] [PATCH v6 12/20] qed: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the qed driver accordingly, taking the opportunity
to inline qed_is_allocated_cb() into its lone caller (the callback
used to be important, until we switched qed to coroutines).  There is
no intent to optimize based on the want_zero flag for this format.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: initialize len before qed_find_cluster() [Vladimir]
v4: rebase to interface change, inline pointless callback
v3: no change
v2: rebase to mapping flag, fix mask in qed_is_allocated_cb
---
 block/qed.c | 84 +
 1 file changed, 28 insertions(+), 56 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index 821dcaa055..13fbc3e2f1 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -688,74 +688,46 @@ finish:
 return ret;
 }

-typedef struct {
-BlockDriverState *bs;
-Coroutine *co;
-uint64_t pos;
-int64_t status;
-int *pnum;
-BlockDriverState **file;
-} QEDIsAllocatedCB;
-
-/* Called with table_lock held.  */
-static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t 
len)
-{
-QEDIsAllocatedCB *cb = opaque;
-BDRVQEDState *s = cb->bs->opaque;
-*cb->pnum = len / BDRV_SECTOR_SIZE;
-switch (ret) {
-case QED_CLUSTER_FOUND:
-offset |= qed_offset_into_cluster(s, cb->pos);
-cb->status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset;
-*cb->file = cb->bs->file->bs;
-break;
-case QED_CLUSTER_ZERO:
-cb->status = BDRV_BLOCK_ZERO;
-break;
-case QED_CLUSTER_L2:
-case QED_CLUSTER_L1:
-cb->status = 0;
-break;
-default:
-assert(ret < 0);
-cb->status = ret;
-break;
-}
-
-if (cb->co) {
-aio_co_wake(cb->co);
-}
-}
-
-static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
- int64_t sector_num,
- int nb_sectors, int *pnum,
+static int coroutine_fn bdrv_qed_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t pos, int64_t bytes,
+ int64_t *pnum, int64_t *map,
  BlockDriverState **file)
 {
 BDRVQEDState *s = bs->opaque;
-size_t len = (size_t)nb_sectors * BDRV_SECTOR_SIZE;
-QEDIsAllocatedCB cb = {
-.bs = bs,
-.pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE,
-.status = BDRV_BLOCK_OFFSET_MASK,
-.pnum = pnum,
-.file = file,
-};
+size_t len = MIN(bytes, SIZE_MAX);
+int status;
 QEDRequest request = { .l2_table = NULL };
 uint64_t offset;
 int ret;

 qemu_co_mutex_lock(>table_lock);
-ret = qed_find_cluster(s, , cb.pos, , );
-qed_is_allocated_cb(, ret, offset, len);
+ret = qed_find_cluster(s, , pos, , );

-/* The callback was invoked immediately */
-assert(cb.status != BDRV_BLOCK_OFFSET_MASK);
+*pnum = len;
+switch (ret) {
+case QED_CLUSTER_FOUND:
+*map = offset | qed_offset_into_cluster(s, pos);
+status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+*file = bs->file->bs;
+break;
+case QED_CLUSTER_ZERO:
+status = BDRV_BLOCK_ZERO;
+break;
+case QED_CLUSTER_L2:
+case QED_CLUSTER_L1:
+status = 0;
+break;
+default:
+assert(ret < 0);
+status = ret;
+break;
+}

 qed_unref_l2_cache_entry(request.l2_table);
 qemu_co_mutex_unlock(>table_lock);

-return cb.status;
+return status;
 }

 static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
@@ -1595,7 +1567,7 @@ static BlockDriver bdrv_qed = {
 .bdrv_child_perm  = bdrv_format_default_perms,
 .bdrv_create  = bdrv_qed_create,
 .bdrv_has_zero_init   = bdrv_has_zero_init_1,
-.bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
+.bdrv_co_block_status = bdrv_qed_co_block_status,
 .bdrv_co_readv= bdrv_qed_co_readv,
 .bdrv_co_writev   = bdrv_qed_co_writev,
 .bdrv_co_pwrite_zeroes= bdrv_qed_co_pwrite_zeroes,
-- 
2.14.3




[Qemu-block] [PATCH v6 13/20] raw: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the raw driver accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: no change
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping
---
 block/raw-format.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/raw-format.c b/block/raw-format.c
index ab552c0954..830243a8e4 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -250,17 +250,17 @@ fail:
 return ret;
 }

-static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num,
-int nb_sectors, int *pnum,
+static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
+bool want_zero, int64_t offset,
+int64_t bytes, int64_t *pnum,
+int64_t *map,
 BlockDriverState **file)
 {
 BDRVRawState *s = bs->opaque;
-*pnum = nb_sectors;
+*pnum = bytes;
 *file = bs->file->bs;
-sector_num += s->offset / BDRV_SECTOR_SIZE;
-return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
-   (sector_num << BDRV_SECTOR_BITS);
+*map = offset + s->offset;
+return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
 }

 static int coroutine_fn raw_co_pwrite_zeroes(BlockDriverState *bs,
@@ -496,7 +496,7 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = _co_pwritev,
 .bdrv_co_pwrite_zeroes = _co_pwrite_zeroes,
 .bdrv_co_pdiscard = _co_pdiscard,
-.bdrv_co_get_block_status = _co_get_block_status,
+.bdrv_co_block_status = _co_block_status,
 .bdrv_truncate= _truncate,
 .bdrv_getlength   = _getlength,
 .has_variable_length  = true,
-- 
2.14.3




[Qemu-block] [PATCH v6 09/20] parallels: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the parallels driver accordingly.  Note that
the internal function block_status() is still sector-based, because
it is still in use by other sector-based functions; but that's okay
because request_alignment is 512 as a result of those functions.
For now, no optimizations are added based on the mapping hint.

Signed-off-by: Eric Blake 

---
v5: fix pnum when return is 0
v4: rebase to interface tweak, R-b dropped
v3: no change
v2: rebase to mapping parameter; it is ignored, so R-b kept
---
 block/parallels.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index 9545761f49..9a3d3748ae 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -280,23 +280,31 @@ static coroutine_fn int 
parallels_co_flush_to_os(BlockDriverState *bs)
 }


-static int64_t coroutine_fn parallels_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn parallels_co_block_status(BlockDriverState *bs,
+  bool want_zero,
+  int64_t offset,
+  int64_t bytes,
+  int64_t *pnum,
+  int64_t *map,
+  BlockDriverState **file)
 {
 BDRVParallelsState *s = bs->opaque;
-int64_t offset;
+int count;

+assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
 qemu_co_mutex_lock(>lock);
-offset = block_status(s, sector_num, nb_sectors, pnum);
+offset = block_status(s, offset >> BDRV_SECTOR_BITS,
+  bytes >> BDRV_SECTOR_BITS, );
 qemu_co_mutex_unlock(>lock);

+*pnum = count * BDRV_SECTOR_SIZE;
 if (offset < 0) {
 return 0;
 }

+*map = offset;
 *file = bs->file->bs;
-return (offset << BDRV_SECTOR_BITS) |
-BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

 static coroutine_fn int parallels_co_writev(BlockDriverState *bs,
@@ -793,7 +801,7 @@ static BlockDriver bdrv_parallels = {
 .bdrv_open = parallels_open,
 .bdrv_close= parallels_close,
 .bdrv_child_perm  = bdrv_format_default_perms,
-.bdrv_co_get_block_status = parallels_co_get_block_status,
+.bdrv_co_block_status = parallels_co_block_status,
 .bdrv_has_zero_init   = bdrv_has_zero_init_1,
 .bdrv_co_flush_to_os  = parallels_co_flush_to_os,
 .bdrv_co_readv  = parallels_co_readv,
-- 
2.14.3




[Qemu-block] [PATCH v6 08/20] null: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the null driver accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: minor fix to type of 'ret'
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping parameter
---
 block/null.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/null.c b/block/null.c
index dd9c13f9ba..b762246860 100644
--- a/block/null.c
+++ b/block/null.c
@@ -223,22 +223,23 @@ static int null_reopen_prepare(BDRVReopenState 
*reopen_state,
 return 0;
 }

-static int64_t coroutine_fn null_co_get_block_status(BlockDriverState *bs,
- int64_t sector_num,
- int nb_sectors, int *pnum,
- BlockDriverState **file)
+static int coroutine_fn null_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
 {
 BDRVNullState *s = bs->opaque;
-off_t start = sector_num * BDRV_SECTOR_SIZE;
+int ret = BDRV_BLOCK_OFFSET_VALID;

-*pnum = nb_sectors;
+*pnum = bytes;
+*map = offset;
 *file = bs;

 if (s->read_zeroes) {
-return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO;
-} else {
-return BDRV_BLOCK_OFFSET_VALID | start;
+ret |= BDRV_BLOCK_ZERO;
 }
+return ret;
 }

 static void null_refresh_filename(BlockDriverState *bs, QDict *opts)
@@ -270,7 +271,7 @@ static BlockDriver bdrv_null_co = {
 .bdrv_co_flush_to_disk  = null_co_flush,
 .bdrv_reopen_prepare= null_reopen_prepare,

-.bdrv_co_get_block_status   = null_co_get_block_status,
+.bdrv_co_block_status   = null_co_block_status,

 .bdrv_refresh_filename  = null_refresh_filename,
 };
@@ -290,7 +291,7 @@ static BlockDriver bdrv_null_aio = {
 .bdrv_aio_flush = null_aio_flush,
 .bdrv_reopen_prepare= null_reopen_prepare,

-.bdrv_co_get_block_status   = null_co_get_block_status,
+.bdrv_co_block_status   = null_co_block_status,

 .bdrv_refresh_filename  = null_refresh_filename,
 };
-- 
2.14.3




[Qemu-block] [PATCH v6 04/20] gluster: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the gluster driver accordingly.

In want_zero mode, we continue to report fine-grained hole
information (the caller wants as much mapping detail as possible);
but when not in that mode, the caller prefers larger *pnum and
merely cares about what offsets are allocated at this layer, rather
than where the holes live.  Since holes still read as zeroes at
this layer (rather than deferring to a backing layer), we can take
the shortcut of skipping find_allocation(), and merely state that
all bytes are allocated.

We can also drop redundant bounds checks that are already
guaranteed by the block layer.

Signed-off-by: Eric Blake 

---
v5: drop redundant code
v4: tweak commit message [Fam], rebase to interface tweak
v3: no change
v2: tweak comments [Prasanna], add mapping, drop R-b
---
 block/gluster.c | 70 -
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 0f4265a3a4..6388ced82e 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1349,68 +1349,66 @@ exit:
 }

 /*
- * Returns the allocation status of the specified sectors.
+ * Returns the allocation status of the specified offset.
  *
- * If 'sector_num' is beyond the end of the disk image the return value is 0
- * and 'pnum' is set to 0.
+ * The block layer guarantees 'offset' and 'bytes' are within bounds.
  *
- * 'pnum' is set to the number of sectors (including and immediately following
- * the specified sector) that are known to be in the same
+ * 'pnum' is set to the number of bytes (including and immediately following
+ * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
- * beyond the end of the disk image it will be clamped.
+ * 'bytes' is the max value 'pnum' should be set to.
  *
- * (Based on raw_co_get_block_status() from file-posix.c.)
+ * (Based on raw_co_block_status() from file-posix.c.)
  */
-static int64_t coroutine_fn qemu_gluster_co_get_block_status(
-BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
-BlockDriverState **file)
+static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t offset,
+ int64_t bytes,
+ int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
 {
 BDRVGlusterState *s = bs->opaque;
-off_t start, data = 0, hole = 0;
-int64_t total_size;
+off_t data = 0, hole = 0;
 int ret = -EINVAL;

 if (!s->fd) {
 return ret;
 }

-start = sector_num * BDRV_SECTOR_SIZE;
-total_size = bdrv_getlength(bs);
-if (total_size < 0) {
-return total_size;
-} else if (start >= total_size) {
-*pnum = 0;
-return 0;
-} else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) {
-nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE);
+if (!want_zero) {
+*pnum = bytes;
+*map = offset;
+*file = bs;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

-ret = find_allocation(bs, start, , );
+ret = find_allocation(bs, offset, , );
 if (ret == -ENXIO) {
 /* Trailing hole */
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_ZERO;
 } else if (ret < 0) {
 /* No info available, so pretend there are no holes */
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_DATA;
-} else if (data == start) {
-/* On a data extent, compute sectors to the end of the extent,
+} else if (data == offset) {
+/* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
-*pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, BDRV_SECTOR_SIZE));
+*pnum = MIN(bytes, hole - offset);
 ret = BDRV_BLOCK_DATA;
 } else {
-/* On a hole, compute sectors to the beginning of the next extent.  */
-assert(hole == start);
-*pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE);
+/* On a hole, compute bytes to the beginning of the next extent.  */
+assert(hole == offset);
+*pnum = MIN(bytes, data - offset);
 ret = BDRV_BLOCK_ZERO;
 }

+*map = offset;
 *file = bs;

-return ret | BDRV_BLOCK_OFFSET_VALID | start;
+return ret | BDRV_BLOCK_OFFSET_VALID;
 }


@@ -1438,7 +1436,7 @@ static BlockDriver bdrv_gluster = {
 #ifdef CONFIG_GLUSTERFS_ZEROFILL
 

[Qemu-block] [PATCH v6 06/20] iscsi: Switch iscsi_allocmap_update() to byte-based

2017-12-07 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert all uses of
the allocmap (no semantic change).  Callers that already had bytes
available are simpler, and callers that now scale to bytes will be
easier to switch to byte-based in the future.

Signed-off-by: Eric Blake 
Acked-by: Paolo Bonzini 

---
v3-v5: no change
v2: rebase to count/bytes rename
---
 block/iscsi.c | 90 +--
 1 file changed, 44 insertions(+), 46 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 8f903d8370..4896d50d6e 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -455,24 +455,22 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int 
open_flags)
 }

 static void
-iscsi_allocmap_update(IscsiLun *iscsilun, int64_t sector_num,
-  int nb_sectors, bool allocated, bool valid)
+iscsi_allocmap_update(IscsiLun *iscsilun, int64_t offset,
+  int64_t bytes, bool allocated, bool valid)
 {
 int64_t cl_num_expanded, nb_cls_expanded, cl_num_shrunk, nb_cls_shrunk;
-int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS;

 if (iscsilun->allocmap == NULL) {
 return;
 }
 /* expand to entirely contain all affected clusters */
-assert(cluster_sectors);
-cl_num_expanded = sector_num / cluster_sectors;
-nb_cls_expanded = DIV_ROUND_UP(sector_num + nb_sectors,
-   cluster_sectors) - cl_num_expanded;
+assert(iscsilun->cluster_size);
+cl_num_expanded = offset / iscsilun->cluster_size;
+nb_cls_expanded = DIV_ROUND_UP(offset + bytes, iscsilun->cluster_size)
+- cl_num_expanded;
 /* shrink to touch only completely contained clusters */
-cl_num_shrunk = DIV_ROUND_UP(sector_num, cluster_sectors);
-nb_cls_shrunk = (sector_num + nb_sectors) / cluster_sectors
-  - cl_num_shrunk;
+cl_num_shrunk = DIV_ROUND_UP(offset, iscsilun->cluster_size);
+nb_cls_shrunk = (offset + bytes) / iscsilun->cluster_size - cl_num_shrunk;
 if (allocated) {
 bitmap_set(iscsilun->allocmap, cl_num_expanded, nb_cls_expanded);
 } else {
@@ -495,26 +493,26 @@ iscsi_allocmap_update(IscsiLun *iscsilun, int64_t 
sector_num,
 }

 static void
-iscsi_allocmap_set_allocated(IscsiLun *iscsilun, int64_t sector_num,
- int nb_sectors)
+iscsi_allocmap_set_allocated(IscsiLun *iscsilun, int64_t offset,
+ int64_t bytes)
 {
-iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, true, true);
+iscsi_allocmap_update(iscsilun, offset, bytes, true, true);
 }

 static void
-iscsi_allocmap_set_unallocated(IscsiLun *iscsilun, int64_t sector_num,
-   int nb_sectors)
+iscsi_allocmap_set_unallocated(IscsiLun *iscsilun, int64_t offset,
+   int64_t bytes)
 {
 /* Note: if cache.direct=on the fifth argument to iscsi_allocmap_update
  * is ignored, so this will in effect be an iscsi_allocmap_set_invalid.
  */
-iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, false, true);
+iscsi_allocmap_update(iscsilun, offset, bytes, false, true);
 }

-static void iscsi_allocmap_set_invalid(IscsiLun *iscsilun, int64_t sector_num,
-   int nb_sectors)
+static void iscsi_allocmap_set_invalid(IscsiLun *iscsilun, int64_t offset,
+   int64_t bytes)
 {
-iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, false, false);
+iscsi_allocmap_update(iscsilun, offset, bytes, false, false);
 }

 static void iscsi_allocmap_invalidate(IscsiLun *iscsilun)
@@ -528,34 +526,30 @@ static void iscsi_allocmap_invalidate(IscsiLun *iscsilun)
 }

 static inline bool
-iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t sector_num,
-int nb_sectors)
+iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t offset,
+int64_t bytes)
 {
 unsigned long size;
 if (iscsilun->allocmap == NULL) {
 return true;
 }
 assert(iscsilun->cluster_size);
-size = DIV_ROUND_UP(sector_num + nb_sectors,
-iscsilun->cluster_size >> BDRV_SECTOR_BITS);
+size = DIV_ROUND_UP(offset + bytes, iscsilun->cluster_size);
 return !(find_next_bit(iscsilun->allocmap, size,
-   sector_num * BDRV_SECTOR_SIZE /
-   iscsilun->cluster_size) == size);
+   offset / iscsilun->cluster_size) == size);
 }

 static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun,
-   int64_t sector_num, int nb_sectors)
+   int64_t offset, int64_t bytes)
 {
 unsigned long size;
 if (iscsilun->allocmap_valid == NULL) {
 return false;
 }
 

[Qemu-block] [PATCH v6 03/20] file-posix: Switch to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the file protocol driver accordingly.

In want_zero mode, we continue to report fine-grained hole
information (the caller wants as much mapping detail as possible);
but when not in that mode, the caller prefers larger *pnum and
merely cares about what offsets are allocated at this layer, rather
than where the holes live.  Since holes still read as zeroes at
this layer (rather than deferring to a backing layer), we can take
the shortcut of skipping lseek(), and merely state that all bytes
are allocated.

We can also drop redundant bounds checks that are already
guaranteed by the block layer.

Signed-off-by: Eric Blake 

---
v5: drop redundant code
v4: tweak commit message [Fam], rebase to interface tweak
v3: no change
v2: tweak comment, add mapping support
---
 block/file-posix.c | 62 +-
 1 file changed, 29 insertions(+), 33 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 36ee89e940..e847c7cdd9 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2128,25 +2128,24 @@ static int find_allocation(BlockDriverState *bs, off_t 
start,
 }

 /*
- * Returns the allocation status of the specified sectors.
+ * Returns the allocation status of the specified offset.
  *
- * If 'sector_num' is beyond the end of the disk image the return value is 0
- * and 'pnum' is set to 0.
+ * The block layer guarantees 'offset' and 'bytes' are within bounds.
  *
- * 'pnum' is set to the number of sectors (including and immediately following
- * the specified sector) that are known to be in the same
+ * 'pnum' is set to the number of bytes (including and immediately following
+ * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
- * beyond the end of the disk image it will be clamped.
+ * 'bytes' is the max value 'pnum' should be set to.
  */
-static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num,
-int nb_sectors, int *pnum,
-BlockDriverState **file)
+static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
+bool want_zero,
+int64_t offset,
+int64_t bytes, int64_t *pnum,
+int64_t *map,
+BlockDriverState **file)
 {
-off_t start, data = 0, hole = 0;
-int64_t total_size;
+off_t data = 0, hole = 0;
 int ret;

 ret = fd_open(bs);
@@ -2154,39 +2153,36 @@ static int64_t coroutine_fn 
raw_co_get_block_status(BlockDriverState *bs,
 return ret;
 }

-start = sector_num * BDRV_SECTOR_SIZE;
-total_size = bdrv_getlength(bs);
-if (total_size < 0) {
-return total_size;
-} else if (start >= total_size) {
-*pnum = 0;
-return 0;
-} else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) {
-nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE);
+if (!want_zero) {
+*pnum = bytes;
+*map = offset;
+*file = bs;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

-ret = find_allocation(bs, start, , );
+ret = find_allocation(bs, offset, , );
 if (ret == -ENXIO) {
 /* Trailing hole */
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_ZERO;
 } else if (ret < 0) {
 /* No info available, so pretend there are no holes */
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_DATA;
-} else if (data == start) {
-/* On a data extent, compute sectors to the end of the extent,
+} else if (data == offset) {
+/* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
-*pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, BDRV_SECTOR_SIZE));
+*pnum = MIN(bytes, hole - offset);
 ret = BDRV_BLOCK_DATA;
 } else {
-/* On a hole, compute sectors to the beginning of the next extent.  */
-assert(hole == start);
-*pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE);
+/* On a hole, compute bytes to the beginning of the next extent.  */
+assert(hole == offset);
+*pnum = MIN(bytes, data - offset);
 ret = BDRV_BLOCK_ZERO;
 }
+*map = offset;
 *file = bs;
-return ret | BDRV_BLOCK_OFFSET_VALID | start;
+return ret | BDRV_BLOCK_OFFSET_VALID;
 }

 static coroutine_fn BlockAIOCB *raw_aio_pdiscard(BlockDriverState *bs,
@@ -2280,7 +2276,7 @@ BlockDriver 

[Qemu-block] [PATCH v6 05/20] iscsi: Switch cluster_sectors to byte-based

2017-12-07 Thread Eric Blake
We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert all uses of
the cluster size in sectors, along with adding assertions that we
are not dividing by zero.

Improve some comment grammar while in the area.

Signed-off-by: Eric Blake 
Acked-by: Paolo Bonzini 

---
v2-v5: no change
---
 block/iscsi.c | 56 +++-
 1 file changed, 35 insertions(+), 21 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 4683f3b244..8f903d8370 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -84,7 +84,7 @@ typedef struct IscsiLun {
 unsigned long *allocmap;
 unsigned long *allocmap_valid;
 long allocmap_size;
-int cluster_sectors;
+int cluster_size;
 bool use_16_for_rw;
 bool write_protected;
 bool lbpme;
@@ -427,9 +427,10 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int 
open_flags)
 {
 iscsi_allocmap_free(iscsilun);

+assert(iscsilun->cluster_size);
 iscsilun->allocmap_size =
-DIV_ROUND_UP(sector_lun2qemu(iscsilun->num_blocks, iscsilun),
- iscsilun->cluster_sectors);
+DIV_ROUND_UP(iscsilun->num_blocks * iscsilun->block_size,
+ iscsilun->cluster_size);

 iscsilun->allocmap = bitmap_try_new(iscsilun->allocmap_size);
 if (!iscsilun->allocmap) {
@@ -437,7 +438,7 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int 
open_flags)
 }

 if (open_flags & BDRV_O_NOCACHE) {
-/* in case that cache.direct = on all allocmap entries are
+/* when cache.direct = on all allocmap entries are
  * treated as invalid to force a relookup of the block
  * status on every read request */
 return 0;
@@ -458,17 +459,19 @@ iscsi_allocmap_update(IscsiLun *iscsilun, int64_t 
sector_num,
   int nb_sectors, bool allocated, bool valid)
 {
 int64_t cl_num_expanded, nb_cls_expanded, cl_num_shrunk, nb_cls_shrunk;
+int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS;

 if (iscsilun->allocmap == NULL) {
 return;
 }
 /* expand to entirely contain all affected clusters */
-cl_num_expanded = sector_num / iscsilun->cluster_sectors;
+assert(cluster_sectors);
+cl_num_expanded = sector_num / cluster_sectors;
 nb_cls_expanded = DIV_ROUND_UP(sector_num + nb_sectors,
-   iscsilun->cluster_sectors) - 
cl_num_expanded;
+   cluster_sectors) - cl_num_expanded;
 /* shrink to touch only completely contained clusters */
-cl_num_shrunk = DIV_ROUND_UP(sector_num, iscsilun->cluster_sectors);
-nb_cls_shrunk = (sector_num + nb_sectors) / iscsilun->cluster_sectors
+cl_num_shrunk = DIV_ROUND_UP(sector_num, cluster_sectors);
+nb_cls_shrunk = (sector_num + nb_sectors) / cluster_sectors
   - cl_num_shrunk;
 if (allocated) {
 bitmap_set(iscsilun->allocmap, cl_num_expanded, nb_cls_expanded);
@@ -532,9 +535,12 @@ iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t 
sector_num,
 if (iscsilun->allocmap == NULL) {
 return true;
 }
-size = DIV_ROUND_UP(sector_num + nb_sectors, iscsilun->cluster_sectors);
+assert(iscsilun->cluster_size);
+size = DIV_ROUND_UP(sector_num + nb_sectors,
+iscsilun->cluster_size >> BDRV_SECTOR_BITS);
 return !(find_next_bit(iscsilun->allocmap, size,
-   sector_num / iscsilun->cluster_sectors) == size);
+   sector_num * BDRV_SECTOR_SIZE /
+   iscsilun->cluster_size) == size);
 }

 static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun,
@@ -544,9 +550,12 @@ static inline bool iscsi_allocmap_is_valid(IscsiLun 
*iscsilun,
 if (iscsilun->allocmap_valid == NULL) {
 return false;
 }
-size = DIV_ROUND_UP(sector_num + nb_sectors, iscsilun->cluster_sectors);
+assert(iscsilun->cluster_size);
+size = DIV_ROUND_UP(sector_num + nb_sectors,
+iscsilun->cluster_size >> BDRV_SECTOR_BITS);
 return (find_next_zero_bit(iscsilun->allocmap_valid, size,
-   sector_num / iscsilun->cluster_sectors) == 
size);
+   sector_num * BDRV_SECTOR_SIZE /
+   iscsilun->cluster_size) == size);
 }

 static int coroutine_fn
@@ -781,16 +790,21 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState 
*bs,
 BlockDriverState *file;
 /* check the block status from the beginning of the cluster
  * containing the start sector */
-int64_t ret = iscsi_co_get_block_status(bs,
-  sector_num - sector_num % iscsilun->cluster_sectors,
-  BDRV_REQUEST_MAX_SECTORS, , );
+int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS;
+

[Qemu-block] [PATCH v6 01/20] block: Add .bdrv_co_block_status() callback

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based. Now that the block layer exposes byte-based allocation,
it's time to tackle the drivers.  Add a new callback that operates
on as small as byte boundaries. Subsequent patches will then update
individual drivers, then finally remove .bdrv_co_get_block_status().

The new code also passes through the 'want_zero' hint, which will
allow subsequent patches to further optimize callers that only care
about how much of the image is allocated (want_zero is false),
rather than full details about runs of zeroes and which offsets the
allocation actually maps to (want_zero is true).  As part of this
effort, fix another part of the documentation: the claim in commit
4c41cb4 that BDRV_BLOCK_ALLOCATED is short for 'DATA || ZERO' is a
lie at the block layer (see commit e88ae2264), even though it is
how the bit is computed from the driver layer.  After all, there
are intentionally cases where we return ZERO but not ALLOCATED at
the block layer, when we know that a read sees zero because the
backing file is too short.  Note that the driver interface is thus
slightly different than the public interface with regards to which
bits will be set, and what guarantees are provided on input.

We also add an assertion that any driver using the new callback will
make progress (the only time pnum will be 0 is if the block layer
already handled an out-of-bounds request, or if there is an error);
the old driver interface did not provide this guarantee, which
could lead to some inf-loops in drastic corner-case failures.

Signed-off-by: Eric Blake 

---
v6: drop now-useless rounding of mid-sector end-of-file hole [Kevin],
better documentation of 'want_zero' [Kevin]
v5: rebase to master, typo fix, document more block layer guarantees
v4: rebase to master
v3: no change
v2: improve alignment handling, ensure all iotests still pass
---
 include/block/block.h | 14 +++---
 include/block/block_int.h | 20 +++-
 block/io.c| 28 +++-
 3 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index c05cac57e5..afc9ece4d6 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -128,19 +128,19 @@ typedef struct HDGeometry {
  * BDRV_BLOCK_ZERO: offset reads as zero
  * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing raw data
  * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
- *   layer (short for DATA || ZERO), set by block layer
- * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this layer
+ *   layer rather than any backing, set by block layer
+ * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this
+ * layer, set by block layer
  *
  * Internal flag:
  * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to request
  * that the block layer recompute the answer from the returned
  * BDS; must be accompanied by just BDRV_BLOCK_OFFSET_VALID.
  *
- * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK) of
- * the return value (old interface) or the entire map parameter (new
- * interface) represent the offset in the returned BDS that is allocated for
- * the corresponding raw data.  However, whether that offset actually
- * contains data also depends on BDRV_BLOCK_DATA, as follows:
+ * If BDRV_BLOCK_OFFSET_VALID is set, the map parameter represents the
+ * host offset within the returned BDS that is allocated for the
+ * corresponding raw guest data.  However, whether that offset
+ * actually contains data also depends on BDRV_BLOCK_DATA, as follows:
  *
  * DATA ZERO OFFSET_VALID
  *  ttt   sectors read as zero, returned file is zero at offset
diff --git a/include/block/block_int.h b/include/block/block_int.h
index a5482775ec..614197f208 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -204,15 +204,25 @@ struct BlockDriver {
 /*
  * Building block for bdrv_block_status[_above] and
  * bdrv_is_allocated[_above].  The driver should answer only
- * according to the current layer, and should not set
- * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
- * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
- * layer guarantees input aligned to request_alignment, as well as
- * non-NULL pnum and file.
+ * according to the current layer, and should only need to set
+ * BDRV_BLOCK_DATA, BDRV_BLOCK_ZERO, BDRV_BLOCK_OFFSET_VALID,
+ * and/or BDRV_BLOCK_RAW; if the current layer defers to a backing
+ * layer, the result should be 0 (and not BDRV_BLOCK_ZERO).  See
+ * block.h for the overall meaning of the bits.  As a hint, the
+ * flag want_zero is true if the caller cares more about precise
+ * mappings (favor accurate 

[Qemu-block] [PATCH v6 02/20] block: Switch passthrough drivers to .bdrv_co_block_status()

2017-12-07 Thread Eric Blake
We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the generic helpers, and all passthrough clients
(blkdebug, commit, mirror, throttle) accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 

---
v5: rebase to master
v4: rebase to interface tweak
v3: rebase to addition of throttle driver
v2: rebase to master, retitle while merging blkdebug, commit, and mirror
---
 include/block/block_int.h | 28 
 block/io.c| 36 
 block/blkdebug.c  | 20 +++-
 block/commit.c|  2 +-
 block/mirror.c|  2 +-
 block/throttle.c  |  2 +-
 6 files changed, 50 insertions(+), 40 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 614197f208..42e5c37a63 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1029,23 +1029,27 @@ void bdrv_format_default_perms(BlockDriverState *bs, 
BdrvChild *c,
uint64_t *nperm, uint64_t *nshared);

 /*
- * Default implementation for drivers to pass bdrv_co_get_block_status() to
+ * Default implementation for drivers to pass bdrv_co_block_status() to
  * their file.
  */
-int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
-int64_t sector_num,
-int nb_sectors,
-int *pnum,
-BlockDriverState 
**file);
+int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
+bool want_zero,
+int64_t offset,
+int64_t bytes,
+int64_t *pnum,
+int64_t *map,
+BlockDriverState **file);
 /*
- * Default implementation for drivers to pass bdrv_co_get_block_status() to
+ * Default implementation for drivers to pass bdrv_co_block_status() to
  * their backing file.
  */
-int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState 
*bs,
-   int64_t sector_num,
-   int nb_sectors,
-   int *pnum,
-   BlockDriverState 
**file);
+int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
+   bool want_zero,
+   int64_t offset,
+   int64_t bytes,
+   int64_t *pnum,
+   int64_t *map,
+   BlockDriverState **file);
 const char *bdrv_get_parent_name(const BlockDriverState *bs);
 void blk_dev_change_media_cb(BlockBackend *blk, bool load, Error **errp);
 bool blk_dev_has_removable_media(BlockBackend *blk);
diff --git a/block/io.c b/block/io.c
index cbb6e6d073..109b3826cd 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1798,30 +1798,34 @@ typedef struct BdrvCoBlockStatusData {
 bool done;
 } BdrvCoBlockStatusData;

-int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
-int64_t sector_num,
-int nb_sectors,
-int *pnum,
-BlockDriverState 
**file)
+int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
+bool want_zero,
+int64_t offset,
+int64_t bytes,
+int64_t *pnum,
+int64_t *map,
+BlockDriverState **file)
 {
 assert(bs->file && bs->file->bs);
-*pnum = nb_sectors;
+*pnum = bytes;
+*map = offset;
 *file = bs->file->bs;
-return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
-   (sector_num << BDRV_SECTOR_BITS);
+return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
 }

-int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState 
*bs,
-   int64_t sector_num,
- 

[Qemu-block] [PATCH v6 00/20] add byte-based block_status driver callbacks

2017-12-07 Thread Eric Blake
There are patches floating around to add NBD_CMD_BLOCK_STATUS,
but NBD wants to report status on byte granularity (even if the
reporting will probably be naturally aligned to sectors or even
much higher levels).  I've therefore started the task of
converting our block status code to report at a byte granularity
rather than sectors.

These patches have been around for a while, but it's time to
finish it now that 2.12 is nearly open for patches.

The overall conversion currently looks like:
part 1: bdrv_is_allocated (merged, commit 51b0a488, 2.10)
part 2: dirty-bitmap (merged, commit ca759622, 2.11)
part 3: bdrv_get_block_status (merged, commit f0a9c18f, 2.11)
part 4: .bdrv_co_block_status (this series, v5 was here [1])

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-12/msg00034.html

Available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-callback-v6

Since v5:
- more documentation improvements [Kevin]
- drop end-of-file rounding, no longer needed since v4 when the
API was fixed to return the full mapped offset via parameter
rather than a rounded offset via return value [Kevin]

001/20:[0064] [FC] 'block: Add .bdrv_co_block_status() callback'
002/20:[] [--] 'block: Switch passthrough drivers to 
.bdrv_co_block_status()'
003/20:[] [--] 'file-posix: Switch to .bdrv_co_block_status()'
004/20:[] [--] 'gluster: Switch to .bdrv_co_block_status()'
005/20:[] [--] 'iscsi: Switch cluster_sectors to byte-based'
006/20:[] [--] 'iscsi: Switch iscsi_allocmap_update() to byte-based'
007/20:[] [--] 'iscsi: Switch to .bdrv_co_block_status()'
008/20:[] [--] 'null: Switch to .bdrv_co_block_status()'
009/20:[] [--] 'parallels: Switch to .bdrv_co_block_status()'
010/20:[] [--] 'qcow: Switch to .bdrv_co_block_status()'
011/20:[] [--] 'qcow2: Switch to .bdrv_co_block_status()'
012/20:[] [--] 'qed: Switch to .bdrv_co_block_status()'
013/20:[] [--] 'raw: Switch to .bdrv_co_block_status()'
014/20:[] [--] 'sheepdog: Switch to .bdrv_co_block_status()'
015/20:[] [--] 'vdi: Avoid bitrot of debugging code'
016/20:[] [--] 'vdi: Switch to .bdrv_co_block_status()'
017/20:[] [--] 'vmdk: Switch to .bdrv_co_block_status()'
018/20:[] [--] 'vpc: Switch to .bdrv_co_block_status()'
019/20:[] [--] 'vvfat: Switch to .bdrv_co_block_status()'
020/20:[0024] [FC] 'block: Drop unused .bdrv_co_get_block_status()'

Eric Blake (20):
  block: Add .bdrv_co_block_status() callback
  block: Switch passthrough drivers to .bdrv_co_block_status()
  file-posix: Switch to .bdrv_co_block_status()
  gluster: Switch to .bdrv_co_block_status()
  iscsi: Switch cluster_sectors to byte-based
  iscsi: Switch iscsi_allocmap_update() to byte-based
  iscsi: Switch to .bdrv_co_block_status()
  null: Switch to .bdrv_co_block_status()
  parallels: Switch to .bdrv_co_block_status()
  qcow: Switch to .bdrv_co_block_status()
  qcow2: Switch to .bdrv_co_block_status()
  qed: Switch to .bdrv_co_block_status()
  raw: Switch to .bdrv_co_block_status()
  sheepdog: Switch to .bdrv_co_block_status()
  vdi: Avoid bitrot of debugging code
  vdi: Switch to .bdrv_co_block_status()
  vmdk: Switch to .bdrv_co_block_status()
  vpc: Switch to .bdrv_co_block_status()
  vvfat: Switch to .bdrv_co_block_status()
  block: Drop unused .bdrv_co_get_block_status()

 include/block/block.h |  14 ++---
 include/block/block_int.h |  51 +--
 block/io.c|  86 +++--
 block/blkdebug.c  |  20 +++---
 block/commit.c|   2 +-
 block/file-posix.c|  62 +-
 block/gluster.c   |  70 ++---
 block/iscsi.c | 157 --
 block/mirror.c|   2 +-
 block/null.c  |  23 +++
 block/parallels.c |  22 ---
 block/qcow.c  |  27 
 block/qcow2.c |  24 +++
 block/qed.c   |  84 +
 block/raw-format.c|  16 ++---
 block/sheepdog.c  |  26 
 block/throttle.c  |   2 +-
 block/vdi.c   |  45 +++--
 block/vmdk.c  |  38 +--
 block/vpc.c   |  43 ++---
 block/vvfat.c |  16 +++--
 21 files changed, 403 insertions(+), 427 deletions(-)

-- 
2.14.3




[Qemu-block] [PATCH v2 6/6] qemu-iotests: add 203 savevm with IOThreads test

2017-12-07 Thread Stefan Hajnoczi
This test case will prevent future regressions with savevm and
IOThreads.

Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/203 | 59 ++
 tests/qemu-iotests/203.out |  6 +
 tests/qemu-iotests/group   |  1 +
 3 files changed, 66 insertions(+)
 create mode 100755 tests/qemu-iotests/203
 create mode 100644 tests/qemu-iotests/203.out

diff --git a/tests/qemu-iotests/203 b/tests/qemu-iotests/203
new file mode 100755
index 00..2c811917d8
--- /dev/null
+++ b/tests/qemu-iotests/203
@@ -0,0 +1,59 @@
+#!/usr/bin/env python
+#
+# Copyright (C) 2017 Red Hat, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+# Creator/Owner: Stefan Hajnoczi 
+#
+# Check that QMP 'migrate' with multiple drives on a single IOThread completes
+# successfully.  This particular command triggered a hang in the source QEMU
+# process due to recursive AioContext locking in bdrv_invalidate_all() and
+# BDRV_POLL_WHILE().
+
+import iotests
+
+iotests.verify_image_format(supported_fmts=['qcow2'])
+iotests.verify_platform(['linux'])
+
+with iotests.FilePath('disk0.img') as disk0_img_path, \
+ iotests.FilePath('disk1.img') as disk1_img_path, \
+ iotests.VM() as vm:
+
+img_size = '10M'
+iotests.qemu_img_pipe('create', '-f', iotests.imgfmt, disk0_img_path, 
img_size)
+iotests.qemu_img_pipe('create', '-f', iotests.imgfmt, disk1_img_path, 
img_size)
+
+iotests.log('Launching VM...')
+(vm.add_object('iothread,id=iothread0')
+   .add_drive(disk0_img_path, 'node-name=drive0-node', interface='none')
+   .add_drive(disk1_img_path, 'node-name=drive1-node', interface='none')
+   .launch())
+
+iotests.log('Setting IOThreads...')
+iotests.log(vm.qmp('x-blockdev-set-iothread',
+   node_name='drive0-node', iothread='iothread0',
+   force=True))
+iotests.log(vm.qmp('x-blockdev-set-iothread',
+   node_name='drive1-node', iothread='iothread0',
+   force=True))
+
+iotests.log('Starting migration...')
+iotests.log(vm.qmp('migrate', uri='exec:cat >/dev/null'))
+while True:
+vm.get_qmp_event(wait=60.0)
+result = vm.qmp('query-migrate')
+status = result.get('return', {}).get('status', None)
+if status == 'completed':
+break
diff --git a/tests/qemu-iotests/203.out b/tests/qemu-iotests/203.out
new file mode 100644
index 00..3f1ff900e4
--- /dev/null
+++ b/tests/qemu-iotests/203.out
@@ -0,0 +1,6 @@
+Launching VM...
+Setting IOThreads...
+{u'return': {}}
+{u'return': {}}
+Starting migration...
+{u'return': {}}
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index d0ee1e2e55..93d96fb22f 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -198,3 +198,4 @@
 198 rw auto
 200 rw auto
 202 rw auto quick
+203 rw auto
-- 
2.14.3




[Qemu-block] [PATCH v2 3/6] blockdev: add x-blockdev-set-iothread force boolean

2017-12-07 Thread Stefan Hajnoczi
When a node is already associated with a BlockBackend the
x-blockdev-set-iothread command refuses to set the IOThread.  This is to
prevent accidentally changing the IOThread when the nodes are in use.

When the nodes are created with -drive they automatically get a
BlockBackend.  In that case we know nothing is using them yet and it's
safe to set the IOThread.  Add a force boolean to override the check.

Signed-off-by: Stefan Hajnoczi 
---
 qapi/block-core.json |  6 +-
 blockdev.c   | 11 ++-
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 741d6c4367..a8cdbc300b 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3962,6 +3962,9 @@
 #
 # @iothread: the name of the IOThread object or null for the main loop
 #
+# @force: true if the node and its children should be moved when a BlockBackend
+# is already attached
+#
 # Note: this command is experimental and intended for test cases that need
 # control over IOThreads only.
 #
@@ -3984,4 +3987,5 @@
 ##
 { 'command': 'x-blockdev-set-iothread',
   'data' : { 'node-name': 'str',
- 'iothread': 'StrOrNull' } }
+ 'iothread': 'StrOrNull',
+ '*force': 'bool' } }
diff --git a/blockdev.c b/blockdev.c
index f75c01f664..9c3a430cfb 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -4131,7 +4131,7 @@ BlockJobInfoList *qmp_query_block_jobs(Error **errp)
 }
 
 void qmp_x_blockdev_set_iothread(const char *node_name, StrOrNull *iothread,
- Error **errp)
+ bool has_force, bool force, Error **errp)
 {
 AioContext *old_context;
 AioContext *new_context;
@@ -4143,10 +4143,11 @@ void qmp_x_blockdev_set_iothread(const char *node_name, 
StrOrNull *iothread,
 return;
 }
 
-/* If we want to allow more extreme test scenarios this guard could be
- * removed.  For now it protects against accidents. */
-if (bdrv_has_blk(bs)) {
-error_setg(errp, "Node %s is in use", node_name);
+/* Protects against accidents. */
+if (!(has_force && force) && bdrv_has_blk(bs)) {
+error_setg(errp, "Node %s is associated with a BlockBackend and could "
+ "be in use (use force=true to override this check)",
+ node_name);
 return;
 }
 
-- 
2.14.3




[Qemu-block] [PATCH v2 5/6] iothread: fix iothread_stop() race condition

2017-12-07 Thread Stefan Hajnoczi
There is a small chance that iothread_stop() hangs as follows:

  Thread 3 (Thread 0x7f63eba5f700 (LWP 16105)):
  #0  0x7f64012c09b6 in ppoll () at /lib64/libc.so.6
  #1  0x55959992eac9 in ppoll (__ss=0x0, __timeout=0x0, __nfds=, __fds=) at /usr/include/bits/poll2.h:77
  #2  0x55959992eac9 in qemu_poll_ns (fds=, nfds=, timeout=) at util/qemu-timer.c:322
  #3  0x559599930711 in aio_poll (ctx=0x55959bdb83c0, 
blocking=blocking@entry=true) at util/aio-posix.c:629
  #4  0x5595996806fe in iothread_run (opaque=0x55959bd78400) at 
iothread.c:59
  #5  0x7f640159f609 in start_thread () at /lib64/libpthread.so.0
  #6  0x7f64012cce6f in clone () at /lib64/libc.so.6

  Thread 1 (Thread 0x7f640b45b280 (LWP 16103)):
  #0  0x7f64015a0b6d in pthread_join () at /lib64/libpthread.so.0
  #1  0x5595999332ef in qemu_thread_join (thread=) at 
util/qemu-thread-posix.c:547
  #2  0x5595996808ae in iothread_stop (iothread=) at 
iothread.c:91
  #3  0x55959968094d in iothread_stop_iter (object=, 
opaque=) at iothread.c:102
  #4  0x559599857d97 in do_object_child_foreach 
(obj=obj@entry=0x55959bdb8100, fn=fn@entry=0x559599680930 , 
opaque=opaque@entry=0x0, recurse=recurse@entry=false) at qom/object.c:852
  #5  0x559599859477 in object_child_foreach (obj=obj@entry=0x55959bdb8100, 
fn=fn@entry=0x559599680930 , opaque=opaque@entry=0x0) at 
qom/object.c:867
  #6  0x559599680a6e in iothread_stop_all () at iothread.c:341
  #7  0x55959955b1d5 in main (argc=, argv=, 
envp=) at vl.c:4913

The relevant code from iothread_run() is:

  while (!atomic_read(>stopping)) {
  aio_poll(iothread->ctx, true);

and iothread_stop():

  iothread->stopping = true;
  aio_notify(iothread->ctx);
  ...
  qemu_thread_join(>thread);

The following scenario can occur:

1. IOThread:
  while (!atomic_read(>stopping)) -> stopping=false

2. Main loop:
  iothread->stopping = true;
  aio_notify(iothread->ctx);

3. IOThread:
  aio_poll(iothread->ctx, true); -> hang

The bug is explained by the AioContext->notify_me doc comments:

  "If this field is 0, everything (file descriptors, bottom halves,
  timers) will be re-evaluated before the next blocking poll(), thus the
  event_notifier_set call can be skipped."

The problem is that "everything" does not include checking
iothread->stopping.  This means iothread_run() will block in aio_poll()
if aio_notify() was called just before aio_poll().

This patch fixes the hang by replacing aio_notify() with
aio_bh_schedule_oneshot().  This makes aio_poll() or g_main_loop_run()
to return.

Implementing this properly required a new bool running flag.  The new
flag prevents races that are tricky if we try to use iothread->stopping.
Now iothread->stopping is purely for iothread_stop() and
iothread->running is purely for the iothread_run() thread.

Signed-off-by: Stefan Hajnoczi 
---
I'm including this patch in this series because it is needed to make the
test case reliable.  It's unrelated to the main goal of the patch
series.
---
 include/sysemu/iothread.h |  3 ++-
 iothread.c| 20 +++-
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/sysemu/iothread.h b/include/sysemu/iothread.h
index 55de1715c7..799614ffd2 100644
--- a/include/sysemu/iothread.h
+++ b/include/sysemu/iothread.h
@@ -29,7 +29,8 @@ typedef struct {
 GOnce once;
 QemuMutex init_done_lock;
 QemuCond init_done_cond;/* is thread initialization done? */
-bool stopping;
+bool stopping;  /* has iothread_stop() been called? */
+bool running;   /* should iothread_run() continue? */
 int thread_id;
 
 /* AioContext poll parameters */
diff --git a/iothread.c b/iothread.c
index e7b93e02a3..d8b6c1fb27 100644
--- a/iothread.c
+++ b/iothread.c
@@ -55,7 +55,7 @@ static void *iothread_run(void *opaque)
 qemu_cond_signal(>init_done_cond);
 qemu_mutex_unlock(>init_done_lock);
 
-while (!atomic_read(>stopping)) {
+while (iothread->running) {
 aio_poll(iothread->ctx, true);
 
 if (atomic_read(>worker_context)) {
@@ -78,16 +78,25 @@ static void *iothread_run(void *opaque)
 return NULL;
 }
 
+/* Runs in iothread_run() thread */
+static void iothread_stop_bh(void *opaque)
+{
+IOThread *iothread = opaque;
+
+iothread->running = false; /* stop iothread_run() */
+
+if (iothread->main_loop) {
+g_main_loop_quit(iothread->main_loop);
+}
+}
+
 void iothread_stop(IOThread *iothread)
 {
 if (!iothread->ctx || iothread->stopping) {
 return;
 }
 iothread->stopping = true;
-aio_notify(iothread->ctx);
-if (atomic_read(>main_loop)) {
-g_main_loop_quit(iothread->main_loop);
-}
+aio_bh_schedule_oneshot(iothread->ctx, iothread_stop_bh, iothread);
 qemu_thread_join(>thread);
 }
 
@@ -134,6 +143,7 @@ static void iothread_complete(UserCreatable *obj, Error 
**errp)
 char *name, *thread_name;
 
 

[Qemu-block] [PATCH v2 4/6] iotests: add VM.add_object()

2017-12-07 Thread Stefan Hajnoczi
The VM.add_object() method can be used to add IOThreads or memory
backend objects.

Signed-off-by: Stefan Hajnoczi 
---
 tests/qemu-iotests/iotests.py | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 6f057904a9..44477e9295 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -197,6 +197,11 @@ class VM(qtest.QEMUQtestMachine):
  socket_scm_helper=socket_scm_helper)
 self._num_drives = 0
 
+def add_object(self, opts):
+self._args.append('-object')
+self._args.append(opts)
+return self
+
 def add_device(self, opts):
 self._args.append('-device')
 self._args.append(opts)
-- 
2.14.3




[Qemu-block] [PATCH v2 0/6] block: avoid recursive AioContext acquire in bdrv_inactivate_all()

2017-12-07 Thread Stefan Hajnoczi
v2:
 * Added docs/devel/multiple-iothreads.txt doc update [Kevin]
 * Added qemu-iotests 203 test case [Kevin]
 * Added iothread_stop() race fix to make 203 reliable

Patch 1 is Paolo's recursive locking removal in bdrv_inactivate_all().  This
solve migration hangs and is the main point of the patch series.

Patches 2-6 add a qemu-iotests test case and update the multiple-iothreads.txt
documentation to discourage recursive AioContext locking.

Based-on: <20171206144550.22295-1-stefa...@redhat.com>

Paolo Bonzini (1):
  block: avoid recursive AioContext acquire in bdrv_inactivate_all()

Stefan Hajnoczi (5):
  docs: mark nested AioContext locking as a legacy API
  blockdev: add x-blockdev-set-iothread force boolean
  iotests: add VM.add_object()
  iothread: fix iothread_stop() race condition
  qemu-iotests: add 203 savevm with IOThreads test

 docs/devel/multiple-iothreads.txt |  7 +++--
 qapi/block-core.json  |  6 +++-
 include/sysemu/iothread.h |  3 +-
 block.c   | 14 --
 blockdev.c| 11 
 iothread.c| 20 +
 tests/qemu-iotests/203| 59 +++
 tests/qemu-iotests/203.out|  6 
 tests/qemu-iotests/group  |  1 +
 tests/qemu-iotests/iotests.py |  5 
 10 files changed, 114 insertions(+), 18 deletions(-)
 create mode 100755 tests/qemu-iotests/203
 create mode 100644 tests/qemu-iotests/203.out

-- 
2.14.3




[Qemu-block] [PATCH v2 1/6] block: avoid recursive AioContext acquire in bdrv_inactivate_all()

2017-12-07 Thread Stefan Hajnoczi
From: Paolo Bonzini 

BDRV_POLL_WHILE() does not support recursive AioContext locking.  It
only releases the AioContext lock once regardless of how many times the
caller has acquired it.  This results in a hang since the IOThread does
not make progress while the AioContext is still locked.

The following steps trigger the hang:

  $ qemu-system-x86_64 -M accel=kvm -m 1G -cpu host \
   -object iothread,id=iothread0 \
   -device virtio-scsi-pci,iothread=iothread0 \
   -drive if=none,id=drive0,file=test.img,format=raw \
   -device scsi-hd,drive=drive0 \
   -drive if=none,id=drive1,file=test.img,format=raw \
   -device scsi-hd,drive=drive1
  $ qemu-system-x86_64 ...same options... \
   -incoming tcp::1234
  (qemu) migrate tcp:127.0.0.1:1234
  ...hang...

Tested-by: Stefan Hajnoczi 
Signed-off-by: Paolo Bonzini 
---
 block.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/block.c b/block.c
index 9a1a0d1e73..1c37ce4554 100644
--- a/block.c
+++ b/block.c
@@ -4320,9 +4320,15 @@ int bdrv_inactivate_all(void)
 BdrvNextIterator it;
 int ret = 0;
 int pass;
+GSList *aio_ctxs = NULL, *ctx;
 
 for (bs = bdrv_first(); bs; bs = bdrv_next()) {
-aio_context_acquire(bdrv_get_aio_context(bs));
+AioContext *aio_context = bdrv_get_aio_context(bs);
+
+if (!g_slist_find(aio_ctxs, aio_context)) {
+aio_ctxs = g_slist_prepend(aio_ctxs, aio_context);
+aio_context_acquire(aio_context);
+}
 }
 
 /* We do two passes of inactivation. The first pass calls to drivers'
@@ -4340,9 +4346,11 @@ int bdrv_inactivate_all(void)
 }
 
 out:
-for (bs = bdrv_first(); bs; bs = bdrv_next()) {
-aio_context_release(bdrv_get_aio_context(bs));
+for (ctx = aio_ctxs; ctx != NULL; ctx = ctx->next) {
+AioContext *aio_context = ctx->data;
+aio_context_release(aio_context);
 }
+g_slist_free(aio_ctxs);
 
 return ret;
 }
-- 
2.14.3




[Qemu-block] [PATCH v2 2/6] docs: mark nested AioContext locking as a legacy API

2017-12-07 Thread Stefan Hajnoczi
See the patch for why nested AioContext locking is no longer allowed.

Signed-off-by: Stefan Hajnoczi 
---
 docs/devel/multiple-iothreads.txt | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/docs/devel/multiple-iothreads.txt 
b/docs/devel/multiple-iothreads.txt
index e4d340bbb7..4f9012d154 100644
--- a/docs/devel/multiple-iothreads.txt
+++ b/docs/devel/multiple-iothreads.txt
@@ -1,4 +1,4 @@
-Copyright (c) 2014 Red Hat Inc.
+Copyright (c) 2014-2017 Red Hat Inc.
 
 This work is licensed under the terms of the GNU GPL, version 2 or later.  See
 the COPYING file in the top-level directory.
@@ -92,8 +92,9 @@ aio_context_acquire()/aio_context_release() for mutual 
exclusion.  Once the
 context is acquired no other thread can access it or run event loop iterations
 in this AioContext.
 
-aio_context_acquire()/aio_context_release() calls may be nested.  This
-means you can call them if you're not sure whether #2 applies.
+Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls.
+Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro
+used in the block layer and can lead to hangs.
 
 There is currently no lock ordering rule if a thread needs to acquire multiple
 AioContexts simultaneously.  Therefore, it is only safe for code holding the
-- 
2.14.3




Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: Tweak 030 in order to trigger a race condition with parallel jobs

2017-12-07 Thread Alberto Garcia
On Thu 07 Dec 2017 08:16:41 PM CET, Eric Blake wrote:
>>  qemu_io('-f', iotests.imgfmt,
>> -'-c', 'write -P %d %d %d' % (i, i*1024*1024, num_kb * 
>> 1024),
>> +'-c', 'write -P 0xFF %dk %dk' % (i * 512, num_kb),
>
> I guess changing from a variable to a fixed 0xff pattern doesn't make
> a difference?

I noticed that with the previous code we would write zeroes to the first
image (i == 0), and with that I can't reproduce the bug. I assume that
block-stream doesn't copy the data in that case. Changing it to anything
!= 0 solves the problem.

And answering your question, it doesn't really matter if we write the
same value in all places, we only check the output of 'qemu-io -c map'.
Plus the areas don't even overlap.

Berto



Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: Tweak 030 in order to trigger a race condition with parallel jobs

2017-12-07 Thread Eric Blake
On 12/07/2017 11:08 AM, Alberto Garcia wrote:
> This patch tweaks TestParallelOps in iotest 030 so it allocates data
> in smaller regions (256KB/512KB instead of 512KB/1MB) and the
> block-stream job in test_stream_commit() only needs to copy data that
> is at the very end of the image.
> 
> This way when the block-stream job is waken up it will finish right

s/waken up/awakened/

> away without any chance of being stopped by block_job_sleep_ns(). This
> triggers the bug that was fixed by 3d5d319e1221082974711af1d09d82f07
> and is therefore a more useful test case for parallel block jobs.
> 
> After this patch the aforementiond bug can also be reproduced with the
> test_stream_parallel() test case.
> 
> Signed-off-by: Alberto Garcia 
> ---
>  tests/qemu-iotests/030 | 12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 

>  # Put data into the images we are copying data from
>  for i in range(self.num_imgs / 2):
> -img_index = i * 2 + 1
> -# Alternate between 512k and 1M.
> +img_index = self.num_imgs - i * 2 - 2
> +# Alternate between 256KB and 512KB.
>  # This way jobs will not finish in the same order they were 
> created
> -num_kb = 512 + 512 * (i % 2)
> +num_kb = 256 + 256 * (i % 2)
>  qemu_io('-f', iotests.imgfmt,
> -'-c', 'write -P %d %d %d' % (i, i*1024*1024, num_kb * 
> 1024),
> +'-c', 'write -P 0xFF %dk %dk' % (i * 512, num_kb),

I guess changing from a variable to a fixed 0xff pattern doesn't make a
difference?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Qemu-devel] [PATCH for-2.12 0/4] qmp dirty bitmap API

2017-12-07 Thread Kevin Wolf
Am 07.12.2017 um 12:56 hat Kashyap Chamarthy geschrieben:
> On Tue, Nov 21, 2017 at 07:10:19PM -0500, John Snow wrote:
> > On 11/21/2017 12:23 PM, Kevin Wolf wrote:
> 
> [...] # Snip lot of good discussion.
> 
> > > On another note, I would double check before adding a new block job type
> > > that this is the right way to go. We have recently noticed that many, if
> > > not all, of the existing block jobs (at least mirror, commit and backup)
> > > are so similar that they implement the same things multiple times and
> > > are just lacking different options and have different bugs. I'm
> > > seriously considering merging all of them into a single job type
> > > internally that just provides options that effectively turn it into one
> > > of the existing job types.
> > > 
> > 
> > I'm not particularly opposed. At the very, very least "backup" and
> > "mirror" are pretty much the same thing and "stream" and "commit" are
> > basically the same.
> > 
> > Forcing the backuppy-job and the consolidatey-job together seems like an
> > ever-so-slightly harder case to make, but I suppose the truth of the
> > matter in all cases is that we're copying data from one node to another...
> 
> So from the above interesting discussion, it seems like Kevin is leaning
> towards a single job type that offers 'stream', 'commit', 'backup', and
> 'mirror' functionality as part of a single command / job type.  Based on
> an instinct, this sounds a bit too stuffy and complex to me.
> 
> And John seems to be leaning towards two block device job types:
> 
>   - 'blockdev-foo' that offers both current 'stream' and 'commit'
> functionality as two different options to the same QMP command; and
> 
>   - 'blockdev-bar' will offer both 'mirror' and 'backup' functionality
> as part of the same QMP command
> 
> FWIW, this seems a bit more palatable, as it is unifying
> similar-functionality-that-differ-slightly into two distinct commands.
> 
> (/me is still wrapping his head around the bitmaps and incremental
> backups infrastructure.)

Commit of the active layer is _already_ a mirror job internally (and not
a stream job). It's pretty clear to me that commit and mirror are almost
the same, backup is pretty similar. Stream is somewhat different and
might make more sense as a separate job type.

Kevin



Re: [Qemu-block] [PATCH 1/2] qcow2: add overlap check for bitmap directory

2017-12-07 Thread John Snow


On 12/07/2017 04:43 AM, Vladimir Sementsov-Ogievskiy wrote:
> 07.12.2017 01:56, John Snow wrote:
>>
>> On 11/30/2017 11:47 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy 
>>> ---
>>>   block/qcow2.h  |  7 +--
>>>   block/qcow2-refcount.c | 12 
>>>   block/qcow2.c  |  6 ++
>>>   3 files changed, 23 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/qcow2.h b/block/qcow2.h
>>> index 6f0ff15dd0..8f226a3609 100644
>>> --- a/block/qcow2.h
>>> +++ b/block/qcow2.h
>>> @@ -98,6 +98,7 @@
>>>   #define QCOW2_OPT_OVERLAP_SNAPSHOT_TABLE
>>> "overlap-check.snapshot-table"
>>>   #define QCOW2_OPT_OVERLAP_INACTIVE_L1 "overlap-check.inactive-l1"
>>>   #define QCOW2_OPT_OVERLAP_INACTIVE_L2 "overlap-check.inactive-l2"
>>> +#define QCOW2_OPT_OVERLAP_BITMAP_DIRECTORY
>>> "overlap-check.bitmap-directory"
>>>   #define QCOW2_OPT_CACHE_SIZE "cache-size"
>>>   #define QCOW2_OPT_L2_CACHE_SIZE "l2-cache-size"
>>>   #define QCOW2_OPT_REFCOUNT_CACHE_SIZE "refcount-cache-size"
>>> @@ -406,8 +407,9 @@ typedef enum QCow2MetadataOverlap {
>>>   QCOW2_OL_SNAPSHOT_TABLE_BITNR = 5,
>>>   QCOW2_OL_INACTIVE_L1_BITNR    = 6,
>>>   QCOW2_OL_INACTIVE_L2_BITNR    = 7,
>>> +    QCOW2_OL_BITMAP_DIRECTORY_BITNR = 8,
>>>   -    QCOW2_OL_MAX_BITNR    = 8,
>>> +    QCOW2_OL_MAX_BITNR  = 9,
>>>     QCOW2_OL_NONE   = 0,
>>>   QCOW2_OL_MAIN_HEADER    = (1 << QCOW2_OL_MAIN_HEADER_BITNR),
>>> @@ -420,12 +422,13 @@ typedef enum QCow2MetadataOverlap {
>>>   /* NOTE: Checking overlaps with inactive L2 tables will result
>>> in bdrv
>>>    * reads. */
>>>   QCOW2_OL_INACTIVE_L2    = (1 << QCOW2_OL_INACTIVE_L2_BITNR),
>>> +    QCOW2_OL_BITMAP_DIRECTORY = (1 << QCOW2_OL_BITMAP_DIRECTORY_BITNR),
>>>   } QCow2MetadataOverlap;
>>>     /* Perform all overlap checks which can be done in constant time */
>>>   #define QCOW2_OL_CONSTANT \
>>>   (QCOW2_OL_MAIN_HEADER | QCOW2_OL_ACTIVE_L1 |
>>> QCOW2_OL_REFCOUNT_TABLE | \
>>> - QCOW2_OL_SNAPSHOT_TABLE)
>>> + QCOW2_OL_SNAPSHOT_TABLE | QCOW2_OL_BITMAP_DIRECTORY)
>>>     /* Perform all overlap checks which don't require disk access */
>>>   #define QCOW2_OL_CACHED \
>>> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
>>> index 3de1ab51ba..a7a2703f26 100644
>>> --- a/block/qcow2-refcount.c
>>> +++ b/block/qcow2-refcount.c
>>> @@ -2585,6 +2585,18 @@ int
>>> qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t
>>> offset,
>>>   }
>>>   }
>>>   +    if ((chk & QCOW2_OL_BITMAP_DIRECTORY) &&
>>> +    (s->autoclear_features & QCOW2_AUTOCLEAR_BITMAPS))
>>> +    {
>>> +    /* update_ext_header_and_dir_in_place firstly drop autoclear
>>> flag,
>>> + * so it will not fail */
>>> +    if (overlaps_with(s->bitmap_directory_offset,
>>> +  s->bitmap_directory_size))
>>> +    {
>>> +    return QCOW2_OL_BITMAP_DIRECTORY;
>>> +    }
>>> +    }
>>> +
>> Isn't the purpose of this function to test if a given offset conflicts
>> with known regions of the file? I don't see you actually utilize the
>> 'offset' parameter here, but maybe I don't understand what you're trying
>> to accomplish.
> 
> #define overlaps_with(ofs, sz) \
>     ranges_overlap(offset, size, ofs, sz)
> 
> I've just copied one of similar blocks in qcow2_check_metadata_overlap()
> 

Sigh, I didn't realize that was a macro. I don't really like lowercase
macros, but you didn't add it.

Reviewed-by: John Snow 

Carry on...



[Qemu-block] [PATCH] iotests: Tweak 030 in order to trigger a race condition with parallel jobs

2017-12-07 Thread Alberto Garcia
This patch tweaks TestParallelOps in iotest 030 so it allocates data
in smaller regions (256KB/512KB instead of 512KB/1MB) and the
block-stream job in test_stream_commit() only needs to copy data that
is at the very end of the image.

This way when the block-stream job is waken up it will finish right
away without any chance of being stopped by block_job_sleep_ns(). This
triggers the bug that was fixed by 3d5d319e1221082974711af1d09d82f07
and is therefore a more useful test case for parallel block jobs.

After this patch the aforementiond bug can also be reproduced with the
test_stream_parallel() test case.

Signed-off-by: Alberto Garcia 
---
 tests/qemu-iotests/030 | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/030 b/tests/qemu-iotests/030
index 457984b8e9..230eb761dd 100755
--- a/tests/qemu-iotests/030
+++ b/tests/qemu-iotests/030
@@ -156,7 +156,7 @@ class TestSingleDrive(iotests.QMPTestCase):
 class TestParallelOps(iotests.QMPTestCase):
 num_ops = 4 # Number of parallel block-stream operations
 num_imgs = num_ops * 2 + 1
-image_len = num_ops * 1024 * 1024
+image_len = num_ops * 512 * 1024
 imgs = []
 
 def setUp(self):
@@ -177,12 +177,12 @@ class TestParallelOps(iotests.QMPTestCase):
 
 # Put data into the images we are copying data from
 for i in range(self.num_imgs / 2):
-img_index = i * 2 + 1
-# Alternate between 512k and 1M.
+img_index = self.num_imgs - i * 2 - 2
+# Alternate between 256KB and 512KB.
 # This way jobs will not finish in the same order they were created
-num_kb = 512 + 512 * (i % 2)
+num_kb = 256 + 256 * (i % 2)
 qemu_io('-f', iotests.imgfmt,
-'-c', 'write -P %d %d %d' % (i, i*1024*1024, num_kb * 
1024),
+'-c', 'write -P 0xFF %dk %dk' % (i * 512, num_kb),
 self.imgs[img_index])
 
 # Attach the drive to the VM
@@ -323,7 +323,7 @@ class TestParallelOps(iotests.QMPTestCase):
 self.assert_no_active_block_jobs()
 
 # Stream from node0 into node2
-result = self.vm.qmp('block-stream', device='node2', job_id='node2')
+result = self.vm.qmp('block-stream', device='node2', 
base_node='node0', job_id='node2')
 self.assert_qmp(result, 'return', {})
 
 # Commit from the active layer into node3
-- 
2.11.0




Re: [Qemu-block] [Qemu-devel] [PATCH] qemu-img: Document --force-share / -U

2017-12-07 Thread Eric Blake
On 12/07/2017 04:58 AM, Kashyap Chamarthy wrote:
> On Thu, Dec 07, 2017 at 04:44:53PM +0800, Fam Zheng wrote:
>> Signed-off-by: Fam Zheng 
>> ---
>>  qemu-img.texi | 8 
>>  1 file changed, 8 insertions(+)
>>
>> diff --git a/qemu-img.texi b/qemu-img.texi
>> index fdcf120f36..3aa63aad55 100644
>> --- a/qemu-img.texi
>> +++ b/qemu-img.texi
>> @@ -57,6 +57,14 @@ exclusive with the @var{-O} parameters. It is currently 
>> required to also use
>>  the @var{-n} parameter to skip image creation. This restriction may be 
>> relaxed
>>  in a future release.
>>  
>> +@item --force-share (-U)
>> +
>> +If specified, qemu-img will open the image with shared permissions, which 
>> makes
> 
> Does 'texi' requires to quote tools like `qemu-img` (or single quotes),
> to highlight them in documentation?

Our usual markup appears to be:

@code{qemu-img}

> 
>> +it less likely to conflict with a running guest's permissions due to image
>> +locking. For example, this can be used to get the image information (with
>> +'info' subcommand) when the image is used by a running guest. Note that this
>> +could produce inconsistent result because of concurrent metadata changes, 
>> etc..
> 
> Super nit-pick:  an ellipsis[*] is three dots :-), so, when applying you
> might want to: s/../.../
> 
> [*] https://dictionary.cambridge.org/dictionary/english/ellipsis

Except that both "etc." and "..." independently convey a sense of
continuation, which means that using both at once is both redundant
(just one will do) and difficult to argue how to typeset (since 'etc.'
is often written with an explicit '.' to emphasize that is an
abbreviation, does that mean you have to write 'etc.''...' for a total
of 4 dots?).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] [PATCH v2 3/6] qapi: add nbd-server-remove

2017-12-07 Thread Vladimir Sementsov-Ogievskiy
Add command for export removing. It is needed for cases when we
don't want to keep export after the operation on it was completed.
The other example is temporary node, created with blockdev-add.
If we want to delete it we should firstly remove corresponding
NBD export.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block.json | 18 ++
 include/block/nbd.h |  3 ++-
 blockdev-nbd.c  | 29 -
 nbd/server.c| 25 ++---
 4 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/qapi/block.json b/qapi/block.json
index 503d4b287b..f83485c325 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -228,6 +228,24 @@
   'data': {'device': 'str', '*name': 'str', '*writable': 'bool'} }
 
 ##
+# @nbd-server-remove:
+#
+# Remove NBD export by name.
+#
+# @name: Export name.
+#
+# @force: Whether active connections to the export should be closed. If this
+# parameter is false the export only becomes hidden from clients (new
+# connections are impossible), and would be automatically freed after
+# all clients are disconnected (default false).
+#
+# Returns: error if the server is not running or export is not found.
+#
+# Since: 2.12
+##
+{ 'command': 'nbd-server-remove', 'data': {'name': 'str', '*force': 'bool'} }
+
+##
 # @nbd-server-stop:
 #
 # Stop QEMU's embedded NBD server, and unregister all devices previously
diff --git a/include/block/nbd.h b/include/block/nbd.h
index 113c707a5e..b89d116597 100644
--- a/include/block/nbd.h
+++ b/include/block/nbd.h
@@ -261,12 +261,13 @@ NBDExport *nbd_export_new(BlockDriverState *bs, off_t 
dev_offset, off_t size,
   bool writethrough, BlockBackend *on_eject_blk,
   Error **errp);
 void nbd_export_close(NBDExport *exp);
+void nbd_export_hide(NBDExport *exp);
 void nbd_export_get(NBDExport *exp);
 void nbd_export_put(NBDExport *exp);
 
 BlockBackend *nbd_export_get_blockdev(NBDExport *exp);
 
-NBDExport *nbd_export_find(const char *name);
+NBDExport *nbd_export_find(const char *name, bool include_hidden);
 void nbd_export_set_name(NBDExport *exp, const char *name);
 void nbd_export_set_description(NBDExport *exp, const char *description);
 void nbd_export_close_all(void);
diff --git a/blockdev-nbd.c b/blockdev-nbd.c
index 46c885aa35..2495d38f2c 100644
--- a/blockdev-nbd.c
+++ b/blockdev-nbd.c
@@ -174,7 +174,7 @@ void qmp_nbd_server_add(const char *device, bool has_name, 
const char *name,
 name = device;
 }
 
-if (nbd_export_find(name)) {
+if (nbd_export_find(name, true)) {
 error_setg(errp, "NBD server already has export named '%s'", name);
 return;
 }
@@ -207,6 +207,33 @@ void qmp_nbd_server_add(const char *device, bool has_name, 
const char *name,
 nbd_export_put(exp);
 }
 
+void qmp_nbd_server_remove(const char *name, bool has_force, bool force,
+   Error **errp)
+{
+NBDExport *exp;
+
+if (!nbd_server) {
+error_setg(errp, "NBD server not running");
+return;
+}
+
+exp = nbd_export_find(name, true);
+if (exp == NULL) {
+error_setg(errp, "Export '%s' is not found", name);
+return;
+}
+
+if (!has_force) {
+force = false;
+}
+
+if (force) {
+nbd_export_close(exp);
+} else {
+nbd_export_hide(exp);
+}
+}
+
 void qmp_nbd_server_stop(Error **errp)
 {
 nbd_export_close_all();
diff --git a/nbd/server.c b/nbd/server.c
index e817c48087..d85f2dc747 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -78,6 +78,10 @@ struct NBDExport {
 
 BlockBackend *eject_notifier_blk;
 Notifier eject_notifier;
+
+/* hidden export is not available for new connections and should be removed
+ * after last client disconnect */
+bool hidden;
 };
 
 static QTAILQ_HEAD(, NBDExport) exports = QTAILQ_HEAD_INITIALIZER(exports);
@@ -300,7 +304,7 @@ static int nbd_negotiate_handle_export_name(NBDClient 
*client, uint32_t length,
 
 trace_nbd_negotiate_handle_export_name_request(name);
 
-client->exp = nbd_export_find(name);
+client->exp = nbd_export_find(name, false);
 if (!client->exp) {
 error_setg(errp, "export not found");
 return -EINVAL;
@@ -429,7 +433,7 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
uint32_t length,
 }
 assert(length == 0);
 
-exp = nbd_export_find(name);
+exp = nbd_export_find(name, false);
 if (!exp) {
 return nbd_negotiate_send_rep_err(client->ioc, NBD_REP_ERR_UNKNOWN,
   opt, errp, "export '%s' not present",
@@ -966,6 +970,9 @@ void nbd_client_put(NBDClient *client)
 if (client->exp) {
 QTAILQ_REMOVE(>exp->clients, client, next);
 nbd_export_put(client->exp);
+if (client->exp->hidden && QTAILQ_EMPTY(>exp->clients)) {
+nbd_export_close(client->exp);
+

[Qemu-block] [PATCH v2 4/6] iotest 147: add cases to test new @name parameter of nbd-server-add

2017-12-07 Thread Vladimir Sementsov-Ogievskiy
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/147 | 68 +-
 tests/qemu-iotests/147.out |  4 +--
 2 files changed, 57 insertions(+), 15 deletions(-)

diff --git a/tests/qemu-iotests/147 b/tests/qemu-iotests/147
index 90f40ed245..d2081df84b 100755
--- a/tests/qemu-iotests/147
+++ b/tests/qemu-iotests/147
@@ -38,8 +38,8 @@ def flatten_sock_addr(crumpled_address):
 
 
 class NBDBlockdevAddBase(iotests.QMPTestCase):
-def blockdev_add_options(self, address, export=None):
-options = { 'node-name': 'nbd-blockdev',
+def blockdev_add_options(self, address, export, node_name):
+options = { 'node-name': node_name,
 'driver': 'raw',
 'file': {
 'driver': 'nbd',
@@ -50,23 +50,28 @@ class NBDBlockdevAddBase(iotests.QMPTestCase):
 options['file']['export'] = export
 return options
 
-def client_test(self, filename, address, export=None):
-bao = self.blockdev_add_options(address, export)
+def client_test(self, filename, address, export=None,
+node_name='nbd-blockdev', delete=True):
+bao = self.blockdev_add_options(address, export, node_name)
 result = self.vm.qmp('blockdev-add', **bao)
 self.assert_qmp(result, 'return', {})
 
+found = False
 result = self.vm.qmp('query-named-block-nodes')
 for node in result['return']:
-if node['node-name'] == 'nbd-blockdev':
+if node['node-name'] == node_name:
+found = True
 if isinstance(filename, str):
 self.assert_qmp(node, 'image/filename', filename)
 else:
 self.assert_json_filename_equal(node['image']['filename'],
 filename)
 break
+self.assertTrue(found)
 
-result = self.vm.qmp('blockdev-del', node_name='nbd-blockdev')
-self.assert_qmp(result, 'return', {})
+if delete:
+result = self.vm.qmp('blockdev-del', node_name=node_name)
+self.assert_qmp(result, 'return', {})
 
 
 class QemuNBD(NBDBlockdevAddBase):
@@ -125,26 +130,63 @@ class BuiltinNBD(NBDBlockdevAddBase):
 except OSError:
 pass
 
-def _server_up(self, address):
+def _server_up(self, address, export_name=None, export_name2=None):
 result = self.server.qmp('nbd-server-start', addr=address)
 self.assert_qmp(result, 'return', {})
 
-result = self.server.qmp('nbd-server-add', device='nbd-export')
+if export_name is None:
+result = self.server.qmp('nbd-server-add', device='nbd-export')
+else:
+result = self.server.qmp('nbd-server-add', device='nbd-export',
+ name=export_name)
 self.assert_qmp(result, 'return', {})
 
+if export_name2 is not None:
+result = self.server.qmp('nbd-server-add', device='nbd-export',
+ name=export_name2)
+self.assert_qmp(result, 'return', {})
+
+
 def _server_down(self):
 result = self.server.qmp('nbd-server-stop')
 self.assert_qmp(result, 'return', {})
 
-def test_inet(self):
+def do_test_inet(self, export_name=None):
 address = { 'type': 'inet',
 'data': {
 'host': 'localhost',
 'port': str(NBD_PORT)
 } }
-self._server_up(address)
-self.client_test('nbd://localhost:%i/nbd-export' % NBD_PORT,
- flatten_sock_addr(address), 'nbd-export')
+self._server_up(address, export_name)
+export_name = export_name or 'nbd-export'
+self.client_test('nbd://localhost:%i/%s' % (NBD_PORT, export_name),
+ flatten_sock_addr(address), export_name)
+self._server_down()
+
+def test_inet_default_export_name(self):
+self.do_test_inet()
+
+def test_inet_same_export_name(self):
+self.do_test_inet('nbd-export')
+
+def test_inet_different_export_name(self):
+self.do_test_inet('shadow')
+
+def test_inet_two_exports(self):
+address = { 'type': 'inet',
+'data': {
+'host': 'localhost',
+'port': str(NBD_PORT)
+} }
+self._server_up(address, 'exp1', 'exp2')
+self.client_test('nbd://localhost:%i/%s' % (NBD_PORT, 'exp1'),
+ flatten_sock_addr(address), 'exp1', 'node1', False)
+self.client_test('nbd://localhost:%i/%s' % (NBD_PORT, 'exp2'),
+ flatten_sock_addr(address), 'exp2', 'node2', False)
+result = self.vm.qmp('blockdev-del', node_name='node1')
+self.assert_qmp(result, 'return', {})
+  

[Qemu-block] [PATCH v2 6/6] iotest 201: new test for qmp nbd-server-remove

2017-12-07 Thread Vladimir Sementsov-Ogievskiy
Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 tests/qemu-iotests/201 | 130 +
 tests/qemu-iotests/201.out |   5 ++
 tests/qemu-iotests/group   |   1 +
 3 files changed, 136 insertions(+)
 create mode 100644 tests/qemu-iotests/201
 create mode 100644 tests/qemu-iotests/201.out

diff --git a/tests/qemu-iotests/201 b/tests/qemu-iotests/201
new file mode 100644
index 00..28d0324195
--- /dev/null
+++ b/tests/qemu-iotests/201
@@ -0,0 +1,130 @@
+#!/usr/bin/env python
+#
+# Tests for qmp command nbd-server-remove.
+#
+# Copyright (c) 2017 Virtuozzo International GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see .
+#
+
+import os
+import sys
+import iotests
+import time
+from iotests import qemu_img, qemu_io, filter_qemu_io, QemuIoInteractive
+
+nbd_port = '10900'
+nbd_uri = 'nbd+tcp://localhost:' + nbd_port + '/exp'
+disk = os.path.join(iotests.test_dir, 'disk')
+
+class TestNbdServerRemove(iotests.QMPTestCase):
+def setUp(self):
+qemu_img('create', '-f', iotests.imgfmt, disk, '1M')
+
+self.vm = iotests.VM().add_drive(disk)
+self.vm.launch()
+
+address = {
+'type': 'inet',
+'data': {
+'host': 'localhost',
+'port': nbd_port
+}
+}
+
+result = self.vm.qmp('nbd-server-start', addr=address)
+self.assert_qmp(result, 'return', {})
+result = self.vm.qmp('nbd-server-add', device='drive0', name='exp')
+self.assert_qmp(result, 'return', {})
+
+def tearDown(self):
+self.vm.shutdown()
+os.remove(disk)
+
+def remove_export(self, name, force=None):
+if force is None:
+result = self.vm.qmp('nbd-server-remove', name=name)
+else:
+result = self.vm.qmp('nbd-server-remove', name=name, force=force)
+self.assert_qmp(result, 'return', {})
+
+def assertExportNotFound(self, name):
+result = self.vm.qmp('nbd-server-remove', name=name)
+self.assert_qmp(result, 'error/desc', "Export 'exp' is not found")
+
+def assertReadOk(self, qemu_io_output):
+self.assertEqual(filter_qemu_io(qemu_io_output).strip(),
+'read 512/512 bytes at offset 0\n' +
+'512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)')
+
+def assertReadFailed(self, qemu_io_output):
+self.assertEqual(filter_qemu_io(qemu_io_output).strip(),
+ 'read failed: Input/output error')
+
+def assertConnectFailed(self, qemu_io_output):
+self.assertEqual(filter_qemu_io(qemu_io_output).strip(),
+ "can't open device nbd+tcp://localhost:10900/exp: " +
+ "Requested export not available\nserver reported: " +
+ "export 'exp' not present")
+
+def do_test_connect_after_remove(self, force=None):
+args = ('-r', '-f', 'raw', '-c', 'read 0 512', nbd_uri)
+self.assertReadOk(qemu_io(*args))
+self.remove_export('exp', force)
+self.assertExportNotFound('exp')
+self.assertConnectFailed(qemu_io(*args))
+
+def test_connect_after_remove_default(self):
+self.do_test_connect_after_remove()
+
+def test_connect_after_remove_safe(self):
+self.do_test_connect_after_remove(False)
+
+def test_connect_after_remove_force(self):
+self.do_test_connect_after_remove(True)
+
+def do_test_remove_during_connect(self, force=None):
+qio = QemuIoInteractive('-r', '-f', 'raw', nbd_uri)
+self.assertReadOk(qio.cmd('read 0 512'))
+self.remove_export('exp', force)
+if force:
+self.assertReadFailed(qio.cmd('read 0 512'))
+self.assertExportNotFound('exp')
+else:
+self.assertReadOk(qio.cmd('read 0 512'))
+qio.close()
+self.assertExportNotFound('exp')
+
+def test_remove_during_connect_default(self):
+self.do_test_remove_during_connect()
+
+def test_remove_during_connect_safe(self):
+self.do_test_remove_during_connect(False)
+
+def test_remove_during_connect_force(self):
+self.do_test_remove_during_connect(True)
+
+def test_remove_during_connect_safe_force(self):
+qio = QemuIoInteractive('-r', '-f', 'raw', nbd_uri)
+

[Qemu-block] [PATCH v2 0/6] nbd export qmp interface

2017-12-07 Thread Vladimir Sementsov-Ogievskiy
Here:
1. separate export name from device name
1.1 several exports per device possible
2. add nbd-server-remove

v2:
01: add r-bs by Max and Eric, add comment to code (hope you don't mind)
03: address export by its name, not by device name
make it possible to force-remove export, which is already
non-force-removed (thourh new "hidden" field)
other patches are new

v1:
Add command to remove nbd export, pair to nbd-server-add.
The whole thing and description are in patch 02.


Vladimir Sementsov-Ogievskiy (6):
  nbd/server: add additional assert to nbd_export_put
  qapi: add name parameter to nbd-server-add
  qapi: add nbd-server-remove
  iotest 147: add cases to test new @name parameter of nbd-server-add
  iotests: implement QemuIoInteractive class
  iotest 201: new test for qmp nbd-server-remove

 qapi/block.json   |  27 -
 include/block/nbd.h   |   3 +-
 blockdev-nbd.c|  41 +++--
 hmp.c |   5 +-
 nbd/server.c  |  31 +-
 tests/qemu-iotests/147|  68 +-
 tests/qemu-iotests/147.out|   4 +-
 tests/qemu-iotests/201| 130 ++
 tests/qemu-iotests/201.out|   5 ++
 tests/qemu-iotests/group  |   1 +
 tests/qemu-iotests/iotests.py |  38 
 11 files changed, 325 insertions(+), 28 deletions(-)
 create mode 100644 tests/qemu-iotests/201
 create mode 100644 tests/qemu-iotests/201.out

-- 
2.11.1




[Qemu-block] [PATCH v2 1/6] nbd/server: add additional assert to nbd_export_put

2017-12-07 Thread Vladimir Sementsov-Ogievskiy
This place is not obvious, nbd_export_close may theoretically reduce
refcount to 0. It may happen if someone calls nbd_export_put on named
export not through nbd_export_set_name when refcount is 1.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
---
 nbd/server.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/nbd/server.c b/nbd/server.c
index 92c0fdd03b..e817c48087 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1190,6 +1190,12 @@ void nbd_export_put(NBDExport *exp)
 nbd_export_close(exp);
 }
 
+/* nbd_export_close() may theoretically reduce refcount to 0. It may happen
+ * if someone calls nbd_export_put() on named export not through
+ * nbd_export_set_name() when refcount is 1. So, let's assert that
+ * it is > 0.
+ */
+assert(exp->refcount > 0);
 if (--exp->refcount == 0) {
 assert(exp->name == NULL);
 assert(exp->description == NULL);
-- 
2.11.1




[Qemu-block] [PATCH v2 2/6] qapi: add name parameter to nbd-server-add

2017-12-07 Thread Vladimir Sementsov-Ogievskiy
Allow user to specify name for new export, to not reuse internal
node name and to not show it to clients.

This also allows creating several exports per device.

Signed-off-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block.json |  9 +++--
 blockdev-nbd.c  | 14 +-
 hmp.c   |  5 +++--
 3 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/qapi/block.json b/qapi/block.json
index f093fa3f27..503d4b287b 100644
--- a/qapi/block.json
+++ b/qapi/block.json
@@ -213,14 +213,19 @@
 #
 # @device: The device name or node name of the node to be exported
 #
+# @name: Export name. If unspecified @device parameter used as export name.
+#(Since 2.12)
+#
 # @writable: Whether clients should be able to write to the device via the
 # NBD connection (default false).
 #
-# Returns: error if the device is already marked for export.
+# Returns: error if the server is not running, or export with the same name
+#  already exists.
 #
 # Since: 1.3.0
 ##
-{ 'command': 'nbd-server-add', 'data': {'device': 'str', '*writable': 'bool'} }
+{ 'command': 'nbd-server-add',
+  'data': {'device': 'str', '*name': 'str', '*writable': 'bool'} }
 
 ##
 # @nbd-server-stop:
diff --git a/blockdev-nbd.c b/blockdev-nbd.c
index 28f551a7b0..46c885aa35 100644
--- a/blockdev-nbd.c
+++ b/blockdev-nbd.c
@@ -158,8 +158,8 @@ void qmp_nbd_server_start(SocketAddressLegacy *addr,
 qapi_free_SocketAddress(addr_flat);
 }
 
-void qmp_nbd_server_add(const char *device, bool has_writable, bool writable,
-Error **errp)
+void qmp_nbd_server_add(const char *device, bool has_name, const char *name,
+bool has_writable, bool writable, Error **errp)
 {
 BlockDriverState *bs = NULL;
 BlockBackend *on_eject_blk;
@@ -170,8 +170,12 @@ void qmp_nbd_server_add(const char *device, bool 
has_writable, bool writable,
 return;
 }
 
-if (nbd_export_find(device)) {
-error_setg(errp, "NBD server already exporting device '%s'", device);
+if (!has_name) {
+name = device;
+}
+
+if (nbd_export_find(name)) {
+error_setg(errp, "NBD server already has export named '%s'", name);
 return;
 }
 
@@ -195,7 +199,7 @@ void qmp_nbd_server_add(const char *device, bool 
has_writable, bool writable,
 return;
 }
 
-nbd_export_set_name(exp, device);
+nbd_export_set_name(exp, name);
 
 /* The list of named exports has a strong reference to this export now and
  * our only way of accessing it is through nbd_export_find(), so we can 
drop
diff --git a/hmp.c b/hmp.c
index 35a7041824..0ea9c09b58 100644
--- a/hmp.c
+++ b/hmp.c
@@ -2203,7 +2203,8 @@ void hmp_nbd_server_start(Monitor *mon, const QDict 
*qdict)
 continue;
 }
 
-qmp_nbd_server_add(info->value->device, true, writable, _err);
+qmp_nbd_server_add(info->value->device, false, NULL,
+   true, writable, _err);
 
 if (local_err != NULL) {
 qmp_nbd_server_stop(NULL);
@@ -2223,7 +2224,7 @@ void hmp_nbd_server_add(Monitor *mon, const QDict *qdict)
 bool writable = qdict_get_try_bool(qdict, "writable", false);
 Error *local_err = NULL;
 
-qmp_nbd_server_add(device, true, writable, _err);
+qmp_nbd_server_add(device, false, NULL, true, writable, _err);
 
 if (local_err != NULL) {
 hmp_handle_error(mon, _err);
-- 
2.11.1




Re: [Qemu-block] [PATCH] block: avoid recursive AioContext acquire in bdrv_inactivate_all()

2017-12-07 Thread Stefan Hajnoczi
On Wed, Dec 06, 2017 at 07:40:28PM +0100, Kevin Wolf wrote:
> Am 06.12.2017 um 18:54 hat Stefan Hajnoczi geschrieben:
> > From: Paolo Bonzini 
> > 
> > BDRV_POLL_WHILE() does not support recursive AioContext locking.  It
> > only releases the AioContext lock once regardless of how many times the
> > caller has acquired it.  This results in a hang since the IOThread does
> > not make progress while the AioContext is still locked.
> > 
> > The following steps trigger the hang:
> > 
> >   $ qemu-system-x86_64 -M accel=kvm -m 1G -cpu host \
> >-object iothread,id=iothread0 \
> >-device virtio-scsi-pci,iothread=iothread0 \
> >-drive if=none,id=drive0,file=test.img,format=raw \
> >-device scsi-hd,drive=drive0 \
> >-drive if=none,id=drive1,file=test.img,format=raw \
> >-device scsi-hd,drive=drive1
> >   $ qemu-system-x86_64 ...same options... \
> >-incoming tcp::1234
> >   (qemu) migrate tcp:127.0.0.1:1234
> >   ...hang...
> 
> Please turn this into a test case.
> 
> We should probably also update docs/devel/multiple-iothreads.txt.
> Currently it says:
> 
> aio_context_acquire()/aio_context_release() calls may be nested.
> This means you can call them if you're not sure whether #2 applies.
> 
> While technically that's still correct as far as the lock is concerned,
> the limitations of BDRV_POLL_WHILE() mean that in practice this is not a
> viable option any more at least in the context of the block layer.

Good point, will fix both things in v2.

Stefan


signature.asc
Description: PGP signature


[Qemu-block] QEMU not honouring bootorder

2017-12-07 Thread Michal Privoznik
Dear list,

I've encountered the following problem. I have two disks:

  /var/lib/libvirt/images/fedora.qcow2  (which contains OS)
  /dev/sde (iSCSI dummy disk just for testing)

Now, when I configure QEMU to start with both of them, QEMU/Seabios
tries to boot from /dev/sde which fails obviously. Even setting
bootorder does not help. Here's my command line:

  qemu-system-x86_64 \
  -boot menu=on,strict=on \
  -device lsi,id=scsi0,bus=pci.0 \
  -drive 
file=/var/lib/libvirt/images/fedora.qcow2,format=qcow2,if=none,id=drive-scsi0 \
  -device scsi-hd,bus=scsi0.0,drive=drive-scsi0,bootindex=1 \
  -drive file=/dev/sde,format=raw,if=none,id=drive-scsi1 \
  -device scsi-block,bus=scsi0.0,drive=drive-scsi1,bootindex=2

It was found that if 'drive-scsi1' is scsi-hd instead of scsi-block
everything works as expected and I can boot my guest successfully.

Michal



Re: [Qemu-block] [Qemu-devel] [PATCH for-2.12 0/4] qmp dirty bitmap API

2017-12-07 Thread Kashyap Chamarthy
On Tue, Nov 21, 2017 at 07:10:19PM -0500, John Snow wrote:
> On 11/21/2017 12:23 PM, Kevin Wolf wrote:

[...] # Snip lot of good discussion.

> > On another note, I would double check before adding a new block job type
> > that this is the right way to go. We have recently noticed that many, if
> > not all, of the existing block jobs (at least mirror, commit and backup)
> > are so similar that they implement the same things multiple times and
> > are just lacking different options and have different bugs. I'm
> > seriously considering merging all of them into a single job type
> > internally that just provides options that effectively turn it into one
> > of the existing job types.
> > 
> 
> I'm not particularly opposed. At the very, very least "backup" and
> "mirror" are pretty much the same thing and "stream" and "commit" are
> basically the same.
> 
> Forcing the backuppy-job and the consolidatey-job together seems like an
> ever-so-slightly harder case to make, but I suppose the truth of the
> matter in all cases is that we're copying data from one node to another...

So from the above interesting discussion, it seems like Kevin is leaning
towards a single job type that offers 'stream', 'commit', 'backup', and
'mirror' functionality as part of a single command / job type.  Based on
an instinct, this sounds a bit too stuffy and complex to me.

And John seems to be leaning towards two block device job types:

  - 'blockdev-foo' that offers both current 'stream' and 'commit'
functionality as two different options to the same QMP command; and

  - 'blockdev-bar' will offer both 'mirror' and 'backup' functionality
as part of the same QMP command

FWIW, this seems a bit more palatable, as it is unifying
similar-functionality-that-differ-slightly into two distinct commands.

(/me is still wrapping his head around the bitmaps and incremental
backups infrastructure.)

> > So even if we want to tie the bitmap management to a block job, we
> > should consider adding it as an option to the existing backup job rather
> > than adding a completely new fourth job type that again does almost the
> > same except for some bitmap mangement stuff on completion.
> > 
> 
> ...except here, where fleecing does not necessarily copy data in the
> same way.
> 
> (It probably could re-use the copy-on-write notifiers that will be
> replaced by filter nodes, but I don't see it reusing much else.)
> 
> We could try it before I naysay it, but where fleecing is concerned
> we're not using QEMU to move any bits around. It does feel pretty
> categorically different from the first four jobs.
> 
> I wouldn't want to see the franken-job be drowned in conditional
> branches for 5,000 options, either. Eliminating some redundancy is good,
> but asserting that all existing jobs (and this possible new one too)
> should all be the same makes me worry that the resulting code would be
> too complex to work with.

[...]

-- 
/kashyap



Re: [Qemu-block] [PATCH] qemu-img: Document --force-share / -U

2017-12-07 Thread Kashyap Chamarthy
On Thu, Dec 07, 2017 at 04:44:53PM +0800, Fam Zheng wrote:
> Signed-off-by: Fam Zheng 
> ---
>  qemu-img.texi | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/qemu-img.texi b/qemu-img.texi
> index fdcf120f36..3aa63aad55 100644
> --- a/qemu-img.texi
> +++ b/qemu-img.texi
> @@ -57,6 +57,14 @@ exclusive with the @var{-O} parameters. It is currently 
> required to also use
>  the @var{-n} parameter to skip image creation. This restriction may be 
> relaxed
>  in a future release.
>  
> +@item --force-share (-U)
> +
> +If specified, qemu-img will open the image with shared permissions, which 
> makes

Does 'texi' requires to quote tools like `qemu-img` (or single quotes),
to highlight them in documentation?

> +it less likely to conflict with a running guest's permissions due to image
> +locking. For example, this can be used to get the image information (with
> +'info' subcommand) when the image is used by a running guest. Note that this
> +could produce inconsistent result because of concurrent metadata changes, 
> etc..

Super nit-pick:  an ellipsis[*] is three dots :-), so, when applying you
might want to: s/../.../

[*] https://dictionary.cambridge.org/dictionary/english/ellipsis

Regardless:

Reviewed-by: Kashyap Chamarthy 

[...]

-- 
/kashyap



Re: [Qemu-block] [Qemu-devel] [PATCH for-2.12 0/4] qmp dirty bitmap API

2017-12-07 Thread Vladimir Sementsov-Ogievskiy

07.12.2017 03:38, John Snow wrote:


On 11/30/2017 07:10 AM, Vladimir Sementsov-Ogievskiy wrote:

18.11.2017 00:35, John Snow wrote:

On 11/17/2017 03:22 AM, Vladimir Sementsov-Ogievskiy wrote:

17.11.2017 06:10, John Snow wrote:

On 11/16/2017 03:17 AM, Vladimir Sementsov-Ogievskiy wrote:

16.11.2017 00:20, John Snow wrote:

On 11/13/2017 11:20 AM, Vladimir Sementsov-Ogievskiy wrote:

Hi all.

There are three qmp commands, needed to implement external backup
API.

Using these three commands, client may do all needed bitmap
management by
hand:

on backup start we need to do a transaction:
    {disable old bitmap, create new bitmap}

on backup success:
    drop old bitmap

on backup fail:
    enable old bitmap
    merge new bitmap to old bitmap
    drop new bitmap


Can you give me an example of how you expect these commands to be
used,
and why they're required?

I'm a little weary about how error-prone these commands might be
and the
potential for incorrect usage seems... high. Why do we require them?

It is needed for incremental backup. It looks like bad idea to export
abdicate/reclaim functionality, it is simpler
and clearer to allow user to merge/enable/disable bitmaps by hand.

usage is like this:

1. we have dirty bitmap bitmap0 for incremental backup.

2. prepare image fleecing (create temporary image with
backing=our_disk)
3. in qmp transaction:
  - disable bitmap0
  - create bitmap1
  - start image fleecing (backup sync=none our_disk -> temp_disk)

This could probably just be its own command, though:

block-job-fleece node=foobar bitmap=bitmap0 etc=etera etc=etera

Could handle forking the bitmap. I'm not sure what the arguments would
look like, but we could name the NBD export here, too. (Assuming the
server is already started and we just need to create the share.)

Then, we can basically do what mirror does:

(1) Cancel
(2) Complete

Cancel would instruct QEMU to keep the bitmap changes (i.e. roll back),
and Complete would instruct QEMU to discard the changes.

This way we don't need to expose commands like split or merge that will
almost always be dangerous to use over QMP.

In fact, a fleecing job would be really convenient even without a
bitmap, because it'd still be nice to have a convenience command for
it.
Using an existing infrastructure and understood paradigm is just a
bonus.

1. If I understand correctly, Kevin and Max said in their report in
Prague about new block-job approach,
    using filter nodes, so I'm not sure that this is a good Idea to
introduce now new old-style block-job, where we can
    do without it.


We could do without it, but it might be a lot better to have everything
wrapped up in a command that's easy to digest instead of releasing 10
smaller commands that have to be executed in a very specific way in
order to work correctly.

I'm thinking about the complexity of error checking here with all the
smaller commands, versus error checking on a larger workflow we
understand and can quality test better.

I'm not sure that filter nodes becoming the new normal for block jobs
precludes our ability to use the job-management API as a handle for
managing the lifetime of a long-running task like fleecing, but I'll
check with Max and Kevin about this.


2. there is the following scenario: customers needs a possibility to
create a backup of data changed since some
point in time. So, maintaining several static and one (or several) activ
bitmaps with a possiblity of merge some of them
and create a backup using this merged bitmap may be convenient.


I think the ability to copy bitmaps and issue differential backups would
be sufficient in all cases I could think of...

so, instead of keeping several static bitmaps with ability to merge them,
you propose to keep several active bitmaps and copy them to make a backup?

so, instead of new qmp command for merge, add new qmp command for copy?

in case of static bitmaps, we can implement saving/loading them to the
image to free RAM space,
so it is better.

or what do you propose for  [2] ?





I'm sorry, I don't think I understand.

"customers needs a possibility to create a backup of data changed since
some point in time."

Is that not the existing case for a simple incremental backup? Granted,
the point in time was decided when we created the bitmap or when we made
the last backup, but it is "since some point in time."

If you mean to say an arbitrary point in time after-the-fact, I don't
see how the API presented here helps enable that functionality.

(by "arbitrary point in time after-the-fact I mean for example: Say a
user installs a malicious application in a VM on Thursday, but the
bitmap was created on Monday. The user wants to go back to Wednesday
evening, but we have no record of that point in time, so we cannot go
back to it.)

Can you elaborate on what you're trying to accomplish so I make sure I'm
considering you carefully?


Yes, point in time is when we create a dirty bitmap. But we want to 
maintain several 

[Qemu-block] [PATCH] qemu-img: Document --force-share / -U

2017-12-07 Thread Fam Zheng
Signed-off-by: Fam Zheng 
---
 qemu-img.texi | 8 
 1 file changed, 8 insertions(+)

diff --git a/qemu-img.texi b/qemu-img.texi
index fdcf120f36..3aa63aad55 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -57,6 +57,14 @@ exclusive with the @var{-O} parameters. It is currently 
required to also use
 the @var{-n} parameter to skip image creation. This restriction may be relaxed
 in a future release.
 
+@item --force-share (-U)
+
+If specified, qemu-img will open the image with shared permissions, which makes
+it less likely to conflict with a running guest's permissions due to image
+locking. For example, this can be used to get the image information (with
+'info' subcommand) when the image is used by a running guest. Note that this
+could produce inconsistent result because of concurrent metadata changes, etc..
+
 @item fmt
 is the disk image format. It is guessed automatically in most cases. See below
 for a description of the supported disk formats.
-- 
2.14.3