Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-12 Thread Richard W.M. Jones
On Thu, Apr 12, 2018 at 03:44:26PM +, Nir Soffer wrote:
> On Thu, Apr 12, 2018 at 5:42 PM Eric Blake  wrote:
> 
> > On 04/12/2018 05:24 AM, Richard W.M. Jones wrote:
> >
> > > I don't think we have nbd-server in RHEL, and in any case wouldn't it
> > > be better to use qemu-nbd?
> > >
> > > You just start a new qemu-nbd process instead of faffing around with
> > > configuration files, kill the qemu-nbd process when you're done, and
> 
> > qemu-nbd supports qcow2 already.
> 
> 
> > That, and qemu-nbd supports extensions such as NBD_CMD_BLOCK_STATUS and
> > NBD_OPT_STRUCTURED_REPLY that nbd-server has not implemented yet; a qemu
> > NBD client talking to a qemu-nbd server is thus going to be able to take
> > advantage of those extensions for better performance that would not be
> > possible with a qemu NBD client talking to an nbd-server instance (at
> > least, not without someone implementing the new features there).  And
> > this is no different from the situation where nbdkit as the server lacks
> > several features; the current rhv-upload patches use a python plugin to
> > nbdkit, which is implemented as serializing all requests; while using
> > qemu-nbd as the server would allow parallel requests to be in flight
> > simultaneously.
> >
> 
> Right, qemu-nbd will be better.
> 
> The manual is not very useful - do we have examples somewhere?

For another project I'm currently running qemu-nbd like this to serve
out the NBD root disk of a small development board that lacks SATA:

  qemu-nbd -t -f raw -x / /dev/VG/root &

That will open a TCP connection on port 10809 which is the standard
NBD port (but you can change the port with the -p option).

For testing you can connect to the server using qemu-io (low level
reads and writes of blocks) or using libguestfs, eg:

  guestfish --format=raw -a nbd://server:10809/
  > run
  > list-filesystems
  > mount /dev/sda1 /
  > ll /

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v



Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-12 Thread Nir Soffer
On Thu, Apr 12, 2018 at 5:42 PM Eric Blake  wrote:

> On 04/12/2018 05:24 AM, Richard W.M. Jones wrote:
>
> > I don't think we have nbd-server in RHEL, and in any case wouldn't it
> > be better to use qemu-nbd?
> >
> > You just start a new qemu-nbd process instead of faffing around with
> > configuration files, kill the qemu-nbd process when you're done, and

> qemu-nbd supports qcow2 already.


> That, and qemu-nbd supports extensions such as NBD_CMD_BLOCK_STATUS and
> NBD_OPT_STRUCTURED_REPLY that nbd-server has not implemented yet; a qemu
> NBD client talking to a qemu-nbd server is thus going to be able to take
> advantage of those extensions for better performance that would not be
> possible with a qemu NBD client talking to an nbd-server instance (at
> least, not without someone implementing the new features there).  And
> this is no different from the situation where nbdkit as the server lacks
> several features; the current rhv-upload patches use a python plugin to
> nbdkit, which is implemented as serializing all requests; while using
> qemu-nbd as the server would allow parallel requests to be in flight
> simultaneously.
>

Right, qemu-nbd will be better.

The manual is not very useful - do we have examples somewhere?

We will consider this for 4.3.

Nir


Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-12 Thread Eric Blake
On 04/12/2018 05:24 AM, Richard W.M. Jones wrote:

> I don't think we have nbd-server in RHEL, and in any case wouldn't it
> be better to use qemu-nbd?
> 
> You just start a new qemu-nbd process instead of faffing around with
> configuration files, kill the qemu-nbd process when you're done, and
> qemu-nbd supports qcow2 already.

That, and qemu-nbd supports extensions such as NBD_CMD_BLOCK_STATUS and
NBD_OPT_STRUCTURED_REPLY that nbd-server has not implemented yet; a qemu
NBD client talking to a qemu-nbd server is thus going to be able to take
advantage of those extensions for better performance that would not be
possible with a qemu NBD client talking to an nbd-server instance (at
least, not without someone implementing the new features there).  And
this is no different from the situation where nbdkit as the server lacks
several features; the current rhv-upload patches use a python plugin to
nbdkit, which is implemented as serializing all requests; while using
qemu-nbd as the server would allow parallel requests to be in flight
simultaneously.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-12 Thread Richard W.M. Jones
On Thu, Apr 12, 2018 at 09:22:16AM +, Nir Soffer wrote:
> I think we can expose NBD using ndb-server and dynamic exports.
> It can work like this:
> 
> 0. Install nbd and enable nbd-server on a host, running
>as vdsm:kvm, not exporting anything.
> 
> 1. User starts transfer session via oVirt API with protocol="nbd"
> (if not specified, use "https" for backward compatibility)
> 
> 2. oVirt selects a host for the transfer and configures an export
>in that host - for example:
> 
> [bbff5ab7-2e8e-45b8-a497-423c2d2b3ff1]
> exportname = /path/to/image
> trim = yes
> timeout = 300
> 
> nbd-server supports reloading configuration via SIGHUP.
> Looks like it should work for adding export dynamically:
> https://github.com/NetworkBlockDevice/nbd/blob/cb6e816c4d2c536cfff5ba9859f210aabe39539b/nbd-server.c#L3007

I don't think we have nbd-server in RHEL, and in any case wouldn't it
be better to use qemu-nbd?

You just start a new qemu-nbd process instead of faffing around with
configuration files, kill the qemu-nbd process when you're done, and
qemu-nbd supports qcow2 already.

It doesn't support progress, but it's not very clear what "progress"
means for an NBD connection, since it's quite reasonable to go back
and rewrite blocks, or (as with qemu-img convert) issue zeroes, or
trims, etc.  It might be better simply to derive progress from ‘du -s
target_file’, ‘qemu-img info target.qcow2’ or similar.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top



Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-12 Thread Nir Soffer
On Thu, Apr 12, 2018 at 2:07 AM Nir Soffer  wrote:

> On Tue, Apr 10, 2018 at 6:53 PM Richard W.M. Jones 
> wrote:
>
...

> Dan Berrange pointed out earlier on that it might be easier if imageio
>>
> just exposed NBD, or if we found a way to tunnel NBD requests over web
>> sockets (in the format case nbdkit would not be needed, in the latter
>> case nbdkit could act as a bridge).
>
>
I think we can expose NBD using ndb-server and dynamic exports.
It can work like this:

0. Install nbd and enable nbd-server on a host, running
   as vdsm:kvm, not exporting anything.

1. User starts transfer session via oVirt API with protocol="nbd"
(if not specified, use "https" for backward compatibility)

2. oVirt selects a host for the transfer and configures an export
   in that host - for example:

[bbff5ab7-2e8e-45b8-a497-423c2d2b3ff1]
exportname = /path/to/image
trim = yes
timeout = 300

nbd-server supports reloading configuration via SIGHUP.
Looks like it should work for adding export dynamically:
https://github.com/NetworkBlockDevice/nbd/blob/cb6e816c4d2c536cfff5ba9859f210aabe39539b/nbd-server.c#L3007

I could find code handling removing exports.

3. User gets back
- transfer_url: "nbd:hostname:port"
- export_name: bbff5ab7-2e8e-45b8-a497-423c2d2b3ff1

Assuming that if NBD_OPT_EXPORT_NAME is bbff5ab7-2e8e-45b8-a497-423c2d2b3ff1
ndb-server will export the /path/to/image mentioned in the config
exportname.

This seems to be the case based on:
- https://github.com/NetworkBlockDevice/nbd/blob/master/nbd-server.c#L998
-
https://github.com/NetworkBlockDevice/nbd/blob/cb6e816c4d2c536cfff5ba9859f210aabe39539b/nbd-server.c#L2074

4. User transmit data using nbd-client or qemu-img or maybe
   a wrapper script to make it easier

5. During the transfer, oVirt monitors progress and activity.

I don't see how we can do this with current nbd-server.

6. User ends the transfer session via oVirt API

7. oVirt removes the export and finalize the transfer.

I guess this can work in a similar way for KubeVirt. The way
the image is provisioned and attached to a host will be
different, but the way a user will transfer image data can be
the same.

Seems that nbd-server is missing:
- way to remove exports dynamically
- way to monitor export activity and progress

Nir


Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-11 Thread Nir Soffer
On Tue, Apr 10, 2018 at 6:53 PM Richard W.M. Jones 
wrote:

> On Tue, Apr 10, 2018 at 03:25:47PM +, Nir Soffer wrote:
> > On Tue, Apr 10, 2018 at 5:50 PM Richard W.M. Jones 
> > wrote:
> >
> > > On Tue, Apr 10, 2018 at 02:07:33PM +, Nir Soffer wrote:
> > > > This makes sense if the device is backed by a block device on oVirt
> side,
> > > > and the NBD support efficient zeroing. But in this case the device is
> > > backed
> > > > by an empty sparse file on NFS, and oVirt does not support yet
> efficient
> > > > zeroing, we just write zeros manually.
> > > >
> > > > I think should be handled on virt-v2v plugin side. When zeroing a
> file
> > > raw
> > > > image,
> > > > you can ignore zero requests after the highest write offset, since
> the
> > > > plugin
> > > > created a new image, and we know that the image is empty.
> > > >
> > > > When the destination is a block device we cannot avoid zeroing since
> a
> > > block
> > > > device may contain junk data (we usually get dirty empty images from
> our
> > > > local
> > > > xtremio server).
> > >
> > > (Off topic for qemu-block but ...)  We don't have enough information
> > > at our end to know about any of this.
> > >
> >
> > Can't use use this logic in the oVirt plugin?
> >
> > file based storage -> skip initial zeroing
> > block based storage -> use initial zeroing
> >
> > Do you think that publishing disk capabilities in the sdk will solve
> this?
>
> The plugin would have to do some complicated gymnastics.  It would
> have to keep track of which areas of the disk have been written and
> ignore NBD_CMD_WRITE_ZEROES for other areas, except if block-based
> storage is being used.  And so yes we'd also need the imageio API to
> publish that information to the plugin.
>
> So it's possible but not trivial.
>

I think this should be fixed in qemu-img. The current zero optimization may
have insignificant improvement when the backend supports fast zeroing
(e.g fallocate, blkzeroout) by minimizing number of syscalls, but it has
severe
performance issue when the desntiation does not support fast zero
(e.g. NFS < 4.2).

We plan to improve zero performance in 4.2.z, but this will not solve the
issue
on NFS < 4.2.

By the way I think we're slowly reimplementing NBD in the imageio API.
>

You can also look at it as - slowly showing that HTTP can replace
NBD :-)


> Dan Berrange pointed out earlier on that it might be easier if imageio
> just exposed NBD, or if we found a way to tunnel NBD requests over web
> sockets (in the format case nbdkit would not be needed, in the latter
> case nbdkit could act as a bridge).
>

I'm not the more complex web sockets are needed for moving images
around. web sockets are good when the clients is a browser, and you need
to pass lot of small messages. If you have a real socket you don't need
web sockets. And the overhead of http headers is insignificant when you
move lot of data around.

Exposing NBD using the same authentication mechanism can be interesting,
but only if clients have an easy way to use this.

Will this allow client (e.g. backup vender) to download and upload images
using qemu-img?

qemu-img <- nbd-> ovirt-imageio <- ndb -> qemu/ndb-server

We also need streaming support - with http you can stream the data from
anywhere, while qemu-img needs a file. I don't think that backup vendors
or users will be happy to implement nbdkit plugins.

If qemu-img would support input and ouptut from stdin/stdout instead of
files, this model sounds much more interesting.

Nir


Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Richard W.M. Jones
On Tue, Apr 10, 2018 at 03:25:47PM +, Nir Soffer wrote:
> On Tue, Apr 10, 2018 at 5:50 PM Richard W.M. Jones 
> wrote:
> 
> > On Tue, Apr 10, 2018 at 02:07:33PM +, Nir Soffer wrote:
> > > This makes sense if the device is backed by a block device on oVirt side,
> > > and the NBD support efficient zeroing. But in this case the device is
> > backed
> > > by an empty sparse file on NFS, and oVirt does not support yet efficient
> > > zeroing, we just write zeros manually.
> > >
> > > I think should be handled on virt-v2v plugin side. When zeroing a file
> > raw
> > > image,
> > > you can ignore zero requests after the highest write offset, since the
> > > plugin
> > > created a new image, and we know that the image is empty.
> > >
> > > When the destination is a block device we cannot avoid zeroing since a
> > block
> > > device may contain junk data (we usually get dirty empty images from our
> > > local
> > > xtremio server).
> >
> > (Off topic for qemu-block but ...)  We don't have enough information
> > at our end to know about any of this.
> >
> 
> Can't use use this logic in the oVirt plugin?
> 
> file based storage -> skip initial zeroing
> block based storage -> use initial zeroing
>
> Do you think that publishing disk capabilities in the sdk will solve this?

The plugin would have to do some complicated gymnastics.  It would
have to keep track of which areas of the disk have been written and
ignore NBD_CMD_WRITE_ZEROES for other areas, except if block-based
storage is being used.  And so yes we'd also need the imageio API to
publish that information to the plugin.

So it's possible but not trivial.

By the way I think we're slowly reimplementing NBD in the imageio API.
Dan Berrange pointed out earlier on that it might be easier if imageio
just exposed NBD, or if we found a way to tunnel NBD requests over web
sockets (in the format case nbdkit would not be needed, in the latter
case nbdkit could act as a bridge).

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org



Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Richard W.M. Jones
On Tue, Apr 10, 2018 at 09:52:40AM -0500, Eric Blake wrote:
> Didn't Rich already try to do that?
> 
> +def emulate_zero(h, count, offset):
> +# qemu-img convert starts by trying to zero/trim the whole device.
> +# Since we've just created a new disk it's safe to ignore these
> +# requests as long as they are smaller than the highest write seen.
> +# After that we must emulate them with writes.
> +if offset+count < h['highestwrite']:
> 
> Or is the problem that emulate_zero() is only being called if:
> 
> +# Unlike the trim and flush calls, there is no 'can_zero' method
> +# so nbdkit could call this even if the server doesn't support
> +# zeroing.  If this is the case we must emulate.
> +if not h['can_zero']:
> +emulate_zero(h, count, offset)
> +return
> 
> rather than doing the 'highestwrite' check unconditionally even when
> oVirt supports zero requests?

Exactly.

I'm not sure it's always safe to call emulate_zero, as I'm not sure
that we always know that the remote oVirt device starts off empty.

Well in one case it's definitely not true: when we created the disk
with disk_format = types.DiskFormat.COW (qcow2).  However I have
currently disabled conversion to qcow2 in this plugin for other
reasons.  (And I think that imageio shouldn't in any case be exposing
the qcow2 format like this -- the APIs are operating at the wrong
level.  The format should be hidden by the imageio APIs and it should
expose only read/write operations on a logically raw device.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html



Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Nir Soffer
On Tue, Apr 10, 2018 at 5:50 PM Richard W.M. Jones 
wrote:

> On Tue, Apr 10, 2018 at 02:07:33PM +, Nir Soffer wrote:
> > This makes sense if the device is backed by a block device on oVirt side,
> > and the NBD support efficient zeroing. But in this case the device is
> backed
> > by an empty sparse file on NFS, and oVirt does not support yet efficient
> > zeroing, we just write zeros manually.
> >
> > I think should be handled on virt-v2v plugin side. When zeroing a file
> raw
> > image,
> > you can ignore zero requests after the highest write offset, since the
> > plugin
> > created a new image, and we know that the image is empty.
> >
> > When the destination is a block device we cannot avoid zeroing since a
> block
> > device may contain junk data (we usually get dirty empty images from our
> > local
> > xtremio server).
>
> (Off topic for qemu-block but ...)  We don't have enough information
> at our end to know about any of this.
>

Can't use use this logic in the oVirt plugin?

file based storage -> skip initial zeroing
block based storage -> use initial zeroing

Do you think that publishing disk capabilities in the sdk will solve this?


> > > The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
> > > so it's not that efficient after all. I'm not sure if there is a real
> > > reason for this, but Eric should know.
> > >
> >
> > We support zero with unlimited size without sending any payload to oVirt,
> > so
> > there is no reason to limit zero request by max_pwrite_zeros. This limit
> may
> > make sense when zero is emulated using pwrite.
>
> Yes, this seems wrong, but I'd want Eric to comment.
>
> > > > However, since you suggest that we could use "trim" request for these
> > > > requests, it means that these requests are advisory (since trim is),
> and
> > > > we can just ignore them if the server does not support trim.
> > >
> > > What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is
> indeed
> > > advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> > > image actually being zeroed after this.
> > >
> >
> > So it seems that may_trim=1 is wrong, since trim cannot replace zero.
>
> Note that the current plugin ignores may_trim.  It is not used at all,
> so it's not relevant to this problem.
>
> However this flag actually corresponds to the inverse of
> NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as:
>
> bit 1, NBD_CMD_FLAG_NO_HOLE; valid during
> NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to
> ensure that the server does not create a hole. The client MAY send
> NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the
> transmission flags field. The server MUST support the use of this
> flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. *
>
> qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag
> (hence in the plugin we see may_trim=1), and I believe that qemu-img
> is correct because it doesn't want to force preallocation.
>

So once oVirt will support efficient zeroing, this flag may be translated to
(for file based storage):

may_trim=1 -> fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE)
may_trim=0 -> fallocate(FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE)

We planed to choose this by default on oVirt side, based on disk type. For
preallocated
disk we never want to use FALLOC_FL_PUNCH_HOLE, and for sparse disk we
always
want to use FALLOC_FL_PUNCH_HOLE unless it is not supported.

Seems that we need to add a "trim" or "punch_hole" flag to the PATCH/zero
request,
so you can hint oVirt how do you want to zero. oVirt will choose what to do
based
on storage type (file/block), user request(trim/notrim), and disk type
(thin/preallocated).

I think we can start the use this flag when we publish the "trim" feature.

Nir


Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Eric Blake
On 04/10/2018 09:07 AM, Nir Soffer wrote:
> On Tue, Apr 10, 2018 at 4:48 PM Kevin Wolf  wrote:
> 
>> Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:
>>> On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones 
>>> wrote:
>>>
 We now have true zeroing support in oVirt imageio, thanks for that.

 However a problem is that ‘qemu-img convert’ issues zero requests for
 the whole disk before starting the transfer.  It does this using 32 MB
 requests which take approx. 1 second each to execute on the oVirt side.
>>>
>>>
 Two problems therefore:

 (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
 20 minutes).  Furthermore there is no progress indication while
>> this
 is happening.

This is going to be true whether or not you write zeroes in 32M chunks
or in 2G chunks - it takes a long time to write actual zeroes to a block
device if you are unsure of whether the device already contains zeroes.
There is more overhead for sending 64 requests of 32M each than 1
request for 2G, there's a question for whether that's in the noise
(slightly more data sent over the wire) or impactful (because you have
to wait for more round trips, where the time spent waiting for traffic
is on par with the time spent writing zeroes for a single request).

The only way that a write zeroes request is not going to be slower than
a normal write is if the block device itself supports an efficient way
to guarantee that the sectors of the disk will read as zero (for
example, using things like WRITE_SAME on iscsi devices).


>>>
 Nothing bad happens: because it is making frequent requests there
 is no timeout.

 (2) I suspect that because we don't have trim support that this is
 actually causing the disk to get fully allocated on the target.

 The NBD requests are sent with may_trim=1 so we could turn these
 into trim requests, but obviously cannot do that while there is no
 trim support.

In fact, if a trim request guarantees that you can read back zeroes
regardless of what was previously on the block device, then that is
precisely what you SHOULD be doing to make write zeroes more efficient
(but only when may_trim=1).


>>>
>>> It sounds like nbdkit is emulating trim with zero instead of noop.

No, qemu-img is NOT requesting trim, it is requesting write zeroes.  You
can implement write zeroes with a trim if the trim will read back as
zeroes.  But while trim is advisory, write zeroes has mandatory
semantics on what you read back (where may_trim=1 is a determining
factor on whether the write MUST allocate, or MAY trim. Ignoring
may_trim and always allocating is semantically correct but may be
slower, while trimming is correct only when may_trim=1).

>>>
>>> I'm not sure why qemu-img is trying to do, I hope the nbd maintainer on
>>> qemu side can explain this.
>>
>> qemu-img tries to efficiently zero out the whole device at once so that
>> it doesn't have to use individual small write requests for unallocated
>> parts of the image later on.

At one point, there was a proposal to have the NBD protocol add
something where the server could advertise to the client if it is known
at initial connection time that the export is starting life with ALL
sectors zeroed.  (Easy to prove for a just-created sparse file, a bit
harder to prove for a block device although at least some iscsi devices
do have queries to learn if the entire device is unallocated).

This has not yet been implemented in the NBD protocol, but may be worth
doing; it is something that is slightly redundant with the
NBD_CMD_BLOCK_STATUS that qemu 2.12 is introducing (in that the client
can perform that sort of query itself rather than the server advertising
it at initial connection), but may be easy enough to implement even
where NBD_CMD_BLOCK_STATUS is difficult that it would still allow
qemu-img to operate more efficiently in some situations.  But qemu-img
DOES know how to skip zeroing a block device if it knows up front that
the device already reads as all zeroes, so the missing piece of
information is getting NBD to tell that to qemu-img.

Meanwhile, NBD_CMD_BLOCK_STATUS is still quite a ways from being
supported in nbdkit, so that's not anything that rhv-upload can exploit
any time soon.

>>
> 
> This makes sense if the device is backed by a block device on oVirt side,
> and the NBD support efficient zeroing. But in this case the device is backed
> by an empty sparse file on NFS, and oVirt does not support yet efficient
> zeroing, we just write zeros manually.
> 
> I think should be handled on virt-v2v plugin side. When zeroing a file raw
> image,
> you can ignore zero requests after the highest write offset, since the
> plugin
> created a new image, and we know that the image is empty.

Didn't Rich already try to do that?

+def emulate_zero(h, count, offset):
+# qemu-img convert starts by trying to zero/trim the 

Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Richard W.M. Jones
On Tue, Apr 10, 2018 at 02:07:33PM +, Nir Soffer wrote:
> This makes sense if the device is backed by a block device on oVirt side,
> and the NBD support efficient zeroing. But in this case the device is backed
> by an empty sparse file on NFS, and oVirt does not support yet efficient
> zeroing, we just write zeros manually.
> 
> I think should be handled on virt-v2v plugin side. When zeroing a file raw
> image,
> you can ignore zero requests after the highest write offset, since the
> plugin
> created a new image, and we know that the image is empty.
> 
> When the destination is a block device we cannot avoid zeroing since a block
> device may contain junk data (we usually get dirty empty images from our
> local
> xtremio server).

(Off topic for qemu-block but ...)  We don't have enough information
at our end to know about any of this.

> > The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
> > so it's not that efficient after all. I'm not sure if there is a real
> > reason for this, but Eric should know.
> >
> 
> We support zero with unlimited size without sending any payload to oVirt,
> so
> there is no reason to limit zero request by max_pwrite_zeros. This limit may
> make sense when zero is emulated using pwrite.

Yes, this seems wrong, but I'd want Eric to comment.

> > > However, since you suggest that we could use "trim" request for these
> > > requests, it means that these requests are advisory (since trim is), and
> > > we can just ignore them if the server does not support trim.
> >
> > What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed
> > advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> > image actually being zeroed after this.
> >
> 
> So it seems that may_trim=1 is wrong, since trim cannot replace zero.

Note that the current plugin ignores may_trim.  It is not used at all,
so it's not relevant to this problem.

However this flag actually corresponds to the inverse of
NBD_CMD_FLAG_NO_HOLE which is defined by the NBD spec as:

bit 1, NBD_CMD_FLAG_NO_HOLE; valid during
NBD_CMD_WRITE_ZEROES. SHOULD be set to 1 if the client wants to
ensure that the server does not create a hole. The client MAY send
NBD_CMD_FLAG_NO_HOLE even if NBD_FLAG_SEND_TRIM was not set in the
transmission flags field. The server MUST support the use of this
flag if it advertises NBD_FLAG_SEND_WRITE_ZEROES. *

qemu-img convert uses NBD_CMD_WRITE_ZEROES and does NOT set this flag
(hence in the plugin we see may_trim=1), and I believe that qemu-img
is correct because it doesn't want to force preallocation.

Rich.

* https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top



Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Richard W.M. Jones
On Tue, Apr 10, 2018 at 03:48:11PM +0200, Kevin Wolf wrote:
> Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:
> > On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones 
> > wrote:
> > 
> > > We now have true zeroing support in oVirt imageio, thanks for that.
> > >
> > > However a problem is that ‘qemu-img convert’ issues zero requests for
> > > the whole disk before starting the transfer.  It does this using 32 MB
> > > requests which take approx. 1 second each to execute on the oVirt side.
> > 
> > 
> > > Two problems therefore:
> > >
> > > (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
> > > 20 minutes).  Furthermore there is no progress indication while this
> > > is happening.
> > >
> > 
> > > Nothing bad happens: because it is making frequent requests there
> > > is no timeout.
> > >
> > > (2) I suspect that because we don't have trim support that this is
> > > actually causing the disk to get fully allocated on the target.
> > >
> > > The NBD requests are sent with may_trim=1 so we could turn these
> > > into trim requests, but obviously cannot do that while there is no
> > > trim support.
> > >
> > 
> > It sounds like nbdkit is emulating trim with zero instead of noop.
> > 
> > I'm not sure why qemu-img is trying to do, I hope the nbd maintainer on
> > qemu side can explain this.
> 
> qemu-img tries to efficiently zero out the whole device at once so that
> it doesn't have to use individual small write requests for unallocated
> parts of the image later on.
> 
> The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
> so it's not that efficient after all. I'm not sure if there is a real
> reason for this, but Eric should know.
> 
> > However, since you suggest that we could use "trim" request for these
> > requests, it means that these requests are advisory (since trim is), and
> > we can just ignore them if the server does not support trim.
> 
> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed
> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> image actually being zeroed after this.

Yup it's actually sending NBD_CMD_WRITE_ZEROES with the flag
NBD_CMD_FLAG_NO_HOLE clear (not set).

I think Eric needs to comment here ..

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org



Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Nir Soffer
On Tue, Apr 10, 2018 at 4:48 PM Kevin Wolf  wrote:

> Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:
> > On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones 
> > wrote:
> >
> > > We now have true zeroing support in oVirt imageio, thanks for that.
> > >
> > > However a problem is that ‘qemu-img convert’ issues zero requests for
> > > the whole disk before starting the transfer.  It does this using 32 MB
> > > requests which take approx. 1 second each to execute on the oVirt side.
> >
> >
> > > Two problems therefore:
> > >
> > > (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
> > > 20 minutes).  Furthermore there is no progress indication while
> this
> > > is happening.
> > >
> >
> > > Nothing bad happens: because it is making frequent requests there
> > > is no timeout.
> > >
> > > (2) I suspect that because we don't have trim support that this is
> > > actually causing the disk to get fully allocated on the target.
> > >
> > > The NBD requests are sent with may_trim=1 so we could turn these
> > > into trim requests, but obviously cannot do that while there is no
> > > trim support.
> > >
> >
> > It sounds like nbdkit is emulating trim with zero instead of noop.
> >
> > I'm not sure why qemu-img is trying to do, I hope the nbd maintainer on
> > qemu side can explain this.
>
> qemu-img tries to efficiently zero out the whole device at once so that
> it doesn't have to use individual small write requests for unallocated
> parts of the image later on.
>

This makes sense if the device is backed by a block device on oVirt side,
and the NBD support efficient zeroing. But in this case the device is backed
by an empty sparse file on NFS, and oVirt does not support yet efficient
zeroing, we just write zeros manually.

I think should be handled on virt-v2v plugin side. When zeroing a file raw
image,
you can ignore zero requests after the highest write offset, since the
plugin
created a new image, and we know that the image is empty.

When the destination is a block device we cannot avoid zeroing since a block
device may contain junk data (we usually get dirty empty images from our
local
xtremio server).


> The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
> so it's not that efficient after all. I'm not sure if there is a real
> reason for this, but Eric should know.
>

We support zero with unlimited size without sending any payload to oVirt,
so
there is no reason to limit zero request by max_pwrite_zeros. This limit may
make sense when zero is emulated using pwrite.


>
> > However, since you suggest that we could use "trim" request for these
> > requests, it means that these requests are advisory (since trim is), and
> > we can just ignore them if the server does not support trim.
>
> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed
> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> image actually being zeroed after this.
>

So it seems that may_trim=1 is wrong, since trim cannot replace zero.

Nir


Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Kevin Wolf
Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:
> On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones 
> wrote:
> 
> > We now have true zeroing support in oVirt imageio, thanks for that.
> >
> > However a problem is that ‘qemu-img convert’ issues zero requests for
> > the whole disk before starting the transfer.  It does this using 32 MB
> > requests which take approx. 1 second each to execute on the oVirt side.
> 
> 
> > Two problems therefore:
> >
> > (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
> > 20 minutes).  Furthermore there is no progress indication while this
> > is happening.
> >
> 
> > Nothing bad happens: because it is making frequent requests there
> > is no timeout.
> >
> > (2) I suspect that because we don't have trim support that this is
> > actually causing the disk to get fully allocated on the target.
> >
> > The NBD requests are sent with may_trim=1 so we could turn these
> > into trim requests, but obviously cannot do that while there is no
> > trim support.
> >
> 
> It sounds like nbdkit is emulating trim with zero instead of noop.
> 
> I'm not sure why qemu-img is trying to do, I hope the nbd maintainer on
> qemu side can explain this.

qemu-img tries to efficiently zero out the whole device at once so that
it doesn't have to use individual small write requests for unallocated
parts of the image later on.

The problem is that the NBD block driver has max_pwrite_zeroes = 32 MB,
so it's not that efficient after all. I'm not sure if there is a real
reason for this, but Eric should know.

> However, since you suggest that we could use "trim" request for these
> requests, it means that these requests are advisory (since trim is), and
> we can just ignore them if the server does not support trim.

What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed
advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
image actually being zeroed after this.

Kevin



Re: [Qemu-block] v2v: -o rhv-upload: Long time spent zeroing the disk

2018-04-10 Thread Nir Soffer
On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones 
wrote:

> We now have true zeroing support in oVirt imageio, thanks for that.
>
> However a problem is that ‘qemu-img convert’ issues zero requests for
> the whole disk before starting the transfer.  It does this using 32 MB
> requests which take approx. 1 second each to execute on the oVirt side.


> Two problems therefore:
>
> (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
> 20 minutes).  Furthermore there is no progress indication while this
> is happening.
>

> Nothing bad happens: because it is making frequent requests there
> is no timeout.
>
> (2) I suspect that because we don't have trim support that this is
> actually causing the disk to get fully allocated on the target.
>
> The NBD requests are sent with may_trim=1 so we could turn these
> into trim requests, but obviously cannot do that while there is no
> trim support.
>

It sounds like nbdkit is emulating trim with zero instead of noop.

I'm not sure why qemu-img is trying to do, I hope the nbd maintainer on
qemu side can explain this.

However, since you suggest that we could use "trim" request for these
requests, it means that these requests are advisory (since trim is), and
we can just ignore them if the server does not support trim.

This will also solve the timeout issue you reported in private mail.

Adding Eric and qemu-block.

Nir