Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-11 Thread Markus Pargmann
On Tuesday 05 April 2016 22:50:51 Wouter Verhelst wrote:
> On Tue, Apr 05, 2016 at 08:14:01AM -0600, Eric Blake wrote:
> > On 04/05/2016 03:24 AM, Markus Pargmann wrote:
> > 
> > >> +requested.
> > >> +
> > >> +The client SHOULD NOT read from an area that has both
> > >> +`NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
> > > 
> > > Why not? If we don't execute CMD_BLOCK_STATUS we wouldn't know about the
> > > situation and would simply read directly. To fulfill this statement we
> > > would need to receive the block status before every read operation.
> > 
> > Because we already state that for NBD_CMD_TRIM, the client SHOULD NOT
> > read an area where a trim was requested without an intervening write,
> 
> Actually, we state that a client SHOULD NOT *make any assumption* about
> the state (emphasis added). There's a difference.
> 
> If the client wants to write, there is no problem (but he'll probably
> get garbage).
> 
> [...]
> > > Also something that is kind of missing from the document so far is
> > > concurrency with other NBD clients. Certainly most users do not use NBD
> > > for concurrent access to the backend storage. But for example the
> > > sentence above ignores the fact that another client may work on the
> > > backend and that the state may change after some time so that it may
> > > still be necessary to read from an area with NBD_STATE_HOLE and
> > > NBD_STATE_ZERO set.
> > 
> > That's missing from NBD in general, and I don't think this is the patch
> > to add it.  We already have concurrency with self issues, because NBD
> > server can handle requests out of order (within the bounds of FLUSH and
> > FUA modifiers).
> 
> I don't think NBD *needs* to add much in the way of concurrency, other
> than write barriers etc; but having an explicit message "some other
> process just modified offset X length Y" seems out of scope for NBD.

Yes it certainly is out of scope to add explicit messages for concurrency. I
only thought about some general information for the reader to have a small
paragraph in the protocol documentation that states that concurrent access can
occur at all times and that no assumptions can be made about the state of the
server and its backend storage at any times.

> 
> > > Also it is uncertain if these status bits may change over time through
> > > reorganization of backend storage, for example holes may be removed in
> > > the backend and so on. Is it safe to cache this stuff?
> > 
> > If the client is the only thing modifying the drive, maybe we want to
> > make that additional constraint on the server.  But how best to word it,
> > or is it too tight of a specification?
> 
> I think it's perfectly fine for the protocol to assume in general that
> the client is the only writer, and that no other writers are
> concurrently accessing the same device. Anything else would make the
> protocol much more complex; and one of the features of NBD is, I think,
> the fact that the protocol is not *too* complex.

Yes this definitely is a feature. I just think that concurrent access was
always possible and we should not by accident remove that possibility. But as
far as I can see that's not an issue here.

Best Regards,

Markus

-- 
Pengutronix e.K.   | |
Industrial Linux Solutions | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |


signature.asc
Description: This is a digitally signed message part.
--
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial! http://pubads.g.doubleclick.net/
gampad/clk?id=1444514301=/ca-pub-7940484522588532___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-07 Thread Paolo Bonzini


On 07/04/2016 17:35, Pavel Borzenkov wrote:
> > On 05/04/2016 06:05, Kevin Wolf wrote:
> > > The options I can think of is adding a request field "max number of
> > > descriptors" or a flag "only single descriptor" (with the assumption
> > > that clients always want one or unlimited), but maybe you have a better
> > > idea.
> > 
> > I think a limit is better.  Even if the client is ultimately going to
> > process the whole file, it may take a very long time and space to
> > retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> > a time, I think it's simpler to put a limit of 1024 descriptors or so.
> 
> Agree. I'm not sure that the value of the limit should be hard-coded
> into the protocol, though. Why don't just allow a server to send less
> data than requested (similar to what SCSI "GET LBA STATUS" allows) and
> allow the server to choose the limit suitable for it (without directly
> exposing it to the clients)?

Yes, definitely.  The number was just an example.  Even 1 would be fine.

Paolo

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-07 Thread Pavel Borzenkov
On Tue, Apr 05, 2016 at 03:43:08PM +0200, Paolo Bonzini wrote:
> 
> 
> On 05/04/2016 06:05, Kevin Wolf wrote:
> > The options I can think of is adding a request field "max number of
> > descriptors" or a flag "only single descriptor" (with the assumption
> > that clients always want one or unlimited), but maybe you have a better
> > idea.
> 
> I think a limit is better.  Even if the client is ultimately going to
> process the whole file, it may take a very long time and space to
> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> a time, I think it's simpler to put a limit of 1024 descriptors or so.

Agree. I'm not sure that the value of the limit should be hard-coded
into the protocol, though. Why don't just allow a server to send less
data than requested (similar to what SCSI "GET LBA STATUS" allows) and
allow the server to choose the limit suitable for it (without directly
exposing it to the clients)?

> Paolo

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-06 Thread Eric Blake
On 04/05/2016 11:57 PM, Denis V. Lunev wrote:

>> Looks like there will still be some more conversation and at least a v3
>> needed, but I'll wait a couple days for that to happen so that more
>> reviewers can chime in, without being too tired during the review
>> process.
>>
> that looks correct to me. I agree that we should set only one flag
> and reject the request with two of them.
> Actually this is like "bitmap number", but we have limitation with 32
> numbers only.
> 
> We could specify that the server MUST reply with "all 1" for unknown
> flag. This would provide nice forward compatibility.

My v2 approach was to define the status so that "all 0" was the safe
default (hence, naming the flag "NBD_STATUS_CLEAN" and set to 1 only
when no longer dirty, not "NBD_STATUS_DIRTY" where 1 by default is safer).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-05 Thread Wouter Verhelst
On Mon, Apr 04, 2016 at 05:32:34PM -0600, Eric Blake wrote:
> On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> > On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
> >> saying about dirtiness, we would soon come to the fact, that
> >> we can have several dirtiness states regarding different
> >> lines of incremental backups. This complexity is hidden
> >> inside QEMU and it would be very difficult to publish and
> >> reuse it.
> > 
> > How about this then.
> > 
> > A reply to GET_BLOCK_STATUS containing chunks of this:
> > 
> > 32-bit length
> > 32-bit "snapshot status"
> > if bit 0 in the latter field is set, that means the block is allocated
> >   on the original device
> > if bit 1 is set, that means the block is allocated on the first-level
> >   snapshot
> > if bit 2 is set, that means the block is allocated on the second-level
> >   snapshot
> 
> The idea of allocation is orthogonal from the idea of reads as zeroes.
> That is, a client may usefully guarantee that something reads as zeroes,
> whether or not it is allocated (but knowing whether it is a hole or
> allocated will determine whether future writes to that area will cause
> file system fragmentation or be at risk of ENOSPC on thin-provisioning).
>  If we want to expose the notion of depth (and I'm not sure about that
> yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
> 'allocated at depth "bit-1"' (and bit 31 as 'allocated at depth 30 or
> greater).
> 
> I don't know if the idea of depth of allocation is useful enough to
> expose in this manner; qemu could certainly advertise depth if the
> protocol calls it out, but I'm still not sure whether knowing depth
> helps any algorithms.
> 
> > 
> > If all flags are cleared, that means the block is not allocated (i.e.,
> > is a hole) and MUST read as zeroes.
> 
> That's too strong.  NBD_CMD_TRIM says that we can create holes whose
> data does not necessarily read as zeroes (and SCSI definitely has
> semantics like this - not all devices guarantee zero reads when you
> UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
> with the faster unmapping operation at the expense of bad reads, or
> slower explicit writes).  Hence my complaint that we have to treat
> 'reads as zero' as an orthogonal bit to 'allocated at depth X'.

Right.

It would be possible to instead mark bit 0 as NBD_ALLOC_LEVEL_DATA (i.e.,
"not just zeroes" -- since I believe a flag value of "one" to indicate
"zeroes" is bound to cause confusion), bit 1 as NBD_ALLOC_LEVEL_CURRENT
(i.e., "allocated in the current snapshot level"), and bits 2 through 31
as NBD_ALLOC_LEVEL_2 through NBD_ALLOC_LEVEL_31, with
implementation-defined meanings, as above.

> > If a flag is set at a particular level X, that means the device is dirty
> > at the Xth-level snapshot.
> > 
> > If at least one flag is set for a region, that means the data may read
> > as "not zero".
> > 
> > The protocol does not define what it means to have multiple levels of
> > snapshots, other than:
> > 
> > - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
> >   Xth level flag if the Yth level flag is not also cleared at the same
> >   time, for any Y > X
> > - Any write (as above) MAY set or clear multiple levels of flags at the
> >   same time, as long as the above holds

Having thought about this on my way to work, I'm now no longer convinced
this is a good idea. I could imagine several scenarios where violating
these constraints might be useful, and none where not violating them
would buy us anything.

Instead, I suggest the following semantics:

- NBD_CMD_WRITE to a region which has NBD_ALLOC_LEVEL_DATA cleared MUST
  set that bit for that region
- NBD_CMD_WRITE_ZEROES or NBD_CMD_TRIM to a region which has
  NBD_ALLOC_LEVEL_DATA set MAY clear that bit for that region
- NBD_CMD_TRIM to a region which has one of the NBD_ALLOC_LEVEL_X flags
  set MAY just clear those flags (and no other). In that case, the
  contents of that region (when read) SHOULD be assumed to have changed,
  but is not defined, and is unlikely to read as zeroes if other
  NBD_ALLOC_LEVEL_* flags are still set.
- NBD_CMD_WRITE to a region which has NBD_ALLOC_LEVEL_CURRENT set SHOULD
  NOT fail with ENOSPC, although it MAY still fail with other errors.
- NBD_CMD_LEVEL_READ from a region which has NBD_ALLOC_LEVEL_DATA clared
  MUST read as zeroes

This may all be a bit overengineered, but it has the benefit of
being fairly generic. It also means that if the flags field is
all-zeroes, the region is "uninteresting".

Thoughts?

[...]

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

--
___
Nbd-general 

Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-05 Thread Wouter Verhelst
On Tue, Apr 05, 2016 at 08:14:01AM -0600, Eric Blake wrote:
> On 04/05/2016 03:24 AM, Markus Pargmann wrote:
> 
> >> +requested.
> >> +
> >> +The client SHOULD NOT read from an area that has both
> >> +`NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
> > 
> > Why not? If we don't execute CMD_BLOCK_STATUS we wouldn't know about the
> > situation and would simply read directly. To fulfill this statement we
> > would need to receive the block status before every read operation.
> 
> Because we already state that for NBD_CMD_TRIM, the client SHOULD NOT
> read an area where a trim was requested without an intervening write,

Actually, we state that a client SHOULD NOT *make any assumption* about
the state (emphasis added). There's a difference.

If the client wants to write, there is no problem (but he'll probably
get garbage).

[...]
> > Also something that is kind of missing from the document so far is
> > concurrency with other NBD clients. Certainly most users do not use NBD
> > for concurrent access to the backend storage. But for example the
> > sentence above ignores the fact that another client may work on the
> > backend and that the state may change after some time so that it may
> > still be necessary to read from an area with NBD_STATE_HOLE and
> > NBD_STATE_ZERO set.
> 
> That's missing from NBD in general, and I don't think this is the patch
> to add it.  We already have concurrency with self issues, because NBD
> server can handle requests out of order (within the bounds of FLUSH and
> FUA modifiers).

I don't think NBD *needs* to add much in the way of concurrency, other
than write barriers etc; but having an explicit message "some other
process just modified offset X length Y" seems out of scope for NBD.

> > Also it is uncertain if these status bits may change over time through
> > reorganization of backend storage, for example holes may be removed in
> > the backend and so on. Is it safe to cache this stuff?
> 
> If the client is the only thing modifying the drive, maybe we want to
> make that additional constraint on the server.  But how best to word it,
> or is it too tight of a specification?

I think it's perfectly fine for the protocol to assume in general that
the client is the only writer, and that no other writers are
concurrently accessing the same device. Anything else would make the
protocol much more complex; and one of the features of NBD is, I think,
the fact that the protocol is not *too* complex.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-05 Thread Alex Bligh

On 5 Apr 2016, at 16:23, Paolo Bonzini  wrote:

>> I'm missing how the ugh-author can write the software without knowing
>> exactly what the bit does. Or are you saying "that's a matter
>> for the qemu spec, not the nbd spec"?
> 
> Yes, that's it.  NBD defines a safe default and a general idea of what
> it should be used for.

OK, so my argument then is we should not have a special case
'only Qemu knows what this does' bit if we can avoid at, and
instead have a general system of 'proprietary bits' (unfortunate
word but perhaps 'non-NBD').

Of course this is made harder as the NBD_CMD_ size is not
extensible (today).

-- 
Alex Bligh





--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-05 Thread Alex Bligh

On 5 Apr 2016, at 15:15, Paolo Bonzini  wrote:

> 
> On 04/04/2016 22:03, Alex Bligh wrote:
>> So qemu-nbdserver has some idea of 'dirtiness', and you want
>> to publish that, and the consumer is qemu (and only qemu),
>> because only qemu knows what 'dirtiness' means in the
>> sense the server provides it?
> 
> The consumer is not QEMU; the consumer is backup software which is not
> part of QEMU (and could even be proprietary, ugh).

I'm missing how the ugh-author can write the software without knowing
exactly what the bit does. Or are you saying "that's a matter
for the qemu spec, not the nbd spec"?

-- 
Alex Bligh





--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-05 Thread Paolo Bonzini


On 04/04/2016 22:03, Alex Bligh wrote:
> So qemu-nbdserver has some idea of 'dirtiness', and you want
> to publish that, and the consumer is qemu (and only qemu),
> because only qemu knows what 'dirtiness' means in the
> sense the server provides it?

The consumer is not QEMU; the consumer is backup software which is not
part of QEMU (and could even be proprietary, ugh).

Paolo

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-05 Thread Eric Blake
On 04/05/2016 03:24 AM, Markus Pargmann wrote:

>> +requested.
>> +
>> +The client SHOULD NOT read from an area that has both
>> +`NBD_STATE_HOLE` set and `NBD_STATE_ZERO` clear.
> 
> Why not? If we don't execute CMD_BLOCK_STATUS we wouldn't know about the
> situation and would simply read directly. To fulfill this statement we
> would need to receive the block status before every read operation.

Because we already state that for NBD_CMD_TRIM, the client SHOULD NOT
read an area where a trim was requested without an intervening write,
precisely because the server MAY (but not MUST) cause the trim to create
bogus reads for that area until another write happens.  I was just
trying to explain that the representation of HOLE and not ZERO
represents the state created by a successful NBD_CMD_TRIM that the
server honors, without being able to guarantee zero reads.

> 
> Also something that is kind of missing from the document so far is
> concurrency with other NBD clients. Certainly most users do not use NBD
> for concurrent access to the backend storage. But for example the
> sentence above ignores the fact that another client may work on the
> backend and that the state may change after some time so that it may
> still be necessary to read from an area with NBD_STATE_HOLE and
> NBD_STATE_ZERO set.

That's missing from NBD in general, and I don't think this is the patch
to add it.  We already have concurrency with self issues, because NBD
server can handle requests out of order (within the bounds of FLUSH and
FUA modifiers).

> 
> Also it is uncertain if these status bits may change over time through
> reorganization of backend storage, for example holes may be removed in
> the backend and so on. Is it safe to cache this stuff?

If the client is the only thing modifying the drive, maybe we want to
make that additional constraint on the server.  But how best to word it,
or is it too tight of a specification?

> 
> Until now something like READ and WRITE where somehow atomic operations
> in the protocol.

Not really.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-05 Thread Wouter Verhelst
On Mon, Apr 04, 2016 at 05:32:34PM -0600, Eric Blake wrote:
> On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> > On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
> >> saying about dirtiness, we would soon come to the fact, that
> >> we can have several dirtiness states regarding different
> >> lines of incremental backups. This complexity is hidden
> >> inside QEMU and it would be very difficult to publish and
> >> reuse it.
> > 
> > How about this then.
> > 
> > A reply to GET_BLOCK_STATUS containing chunks of this:
> > 
> > 32-bit length
> > 32-bit "snapshot status"
> > if bit 0 in the latter field is set, that means the block is allocated
> >   on the original device
> > if bit 1 is set, that means the block is allocated on the first-level
> >   snapshot
> > if bit 2 is set, that means the block is allocated on the second-level
> >   snapshot
> 
> The idea of allocation is orthogonal from the idea of reads as zeroes.
> That is, a client may usefully guarantee that something reads as zeroes,
> whether or not it is allocated (but knowing whether it is a hole or
> allocated will determine whether future writes to that area will cause
> file system fragmentation or be at risk of ENOSPC on thin-provisioning).
>  If we want to expose the notion of depth (and I'm not sure about that
> yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
> 'allocated at depth "bit-1"'

That works too, I suppose.

> (and bit 31 as 'allocated at depth 30 or greater).

The reason I said "the protocol does not define" was specifically to
avoid this. There is nothing that compels a server to map its
implementation-specific snapshots one-to-one on NBD protocol snapshots.
E.g., a client could ask for a snapshot out of band; the server would
then create a new snapshot on top of its existing snapshots, and map
things as: bit 0 for "reads as zeroes", bit 1 for "allocated before the
snapshot you just asked me for", folding together all allocation bits
for older snapshots, and bit 2 as "this block has been changed in the
context of the snapshot you just asked me for".

The out-of-band protocol could alternatively say "NBD snapshot level 13
is the snapshot I just created for you".

> I don't know if the idea of depth of allocation is useful enough to
> expose in this manner; qemu could certainly advertise depth if the
> protocol calls it out, but I'm still not sure whether knowing depth
> helps any algorithms.

The point is that with the proposed format, depth *can* be exposed but
doesn't *have* to be. The original proposal provided two levels of
allocation (on the original device, and on the snapshot or whatever
tracking method is used). This proposal provides 32 (if counting the "is
zeroes" bit as a separate level).

> > If all flags are cleared, that means the block is not allocated (i.e.,
> > is a hole) and MUST read as zeroes.
> 
> That's too strong.  NBD_CMD_TRIM says that we can create holes whose
> data does not necessarily read as zeroes (and SCSI definitely has
> semantics like this - not all devices guarantee zero reads when you
> UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
> with the faster unmapping operation at the expense of bad reads, or
> slower explicit writes).  Hence my complaint that we have to treat
> 'reads as zero' as an orthogonal bit to 'allocated at depth X'.

Right, so we should special-case the first bit then, as you suggest.

> > If a flag is set at a particular level X, that means the device is dirty
> > at the Xth-level snapshot.
> > 
> > If at least one flag is set for a region, that means the data may read
> > as "not zero".
> > 
> > The protocol does not define what it means to have multiple levels of
> > snapshots, other than:
> > 
> > - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
> >   Xth level flag if the Yth level flag is not also cleared at the same
> >   time, for any Y > X
> > - Any write (as above) MAY set or clear multiple levels of flags at the
> >   same time, as long as the above holds
> > 
> > Having a 32-bit snapshot status field allows for 32 levels of snapshots.
> > We could switch length and flags to 64 bits so that things continue to
> > align nicely, and then we have a maximum of 64 levels of snapshots.
> 
> 64 bits may not lay out as nicely (a 12-byte struct is not as efficient
> to copy between the wire and a C array as a 8-byte struct).

Which is why I said to switch *both* fields to 64 bits, so that it
becomes 16 bytes rather than 12 :-)

> > (I'm not going to write this up formally at this time of night, but you
> > get the general idea)
> 
> The idea may make it possible to expose dirty information as a layer of
> depth (from the qemu perspective, each qcow2 file would occupy 2 layers
> of depth: one if dirty, and another if allocated; then deeper layers are
> determined by backing files).  But I'm also worried that it may be more
> complicated than the original question at hand (qemu wants to 

Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-05 Thread Alex Bligh

On 5 Apr 2016, at 00:08, Wouter Verhelst  wrote:

> 
> if bit 0 in the latter field is set, that means the block is allocated
>  on the original device
> if bit 1 is set, that means the block is allocated on the first-level
>  snapshot
> if bit 2 is set, that means the block is allocated on the second-level
>  snapshot

This is what I originally thought Eric was proposing. The issue
is that in common servers the 'dirtiness' is at a different
allocation / blocksize from the allocation status. So you
really want 2 different bitmaps.

-- 
Alex Bligh





--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Eric Blake
On 04/04/2016 05:08 PM, Wouter Verhelst wrote:
> On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
>> saying about dirtiness, we would soon come to the fact, that
>> we can have several dirtiness states regarding different
>> lines of incremental backups. This complexity is hidden
>> inside QEMU and it would be very difficult to publish and
>> reuse it.
> 
> How about this then.
> 
> A reply to GET_BLOCK_STATUS containing chunks of this:
> 
> 32-bit length
> 32-bit "snapshot status"
> if bit 0 in the latter field is set, that means the block is allocated
>   on the original device
> if bit 1 is set, that means the block is allocated on the first-level
>   snapshot
> if bit 2 is set, that means the block is allocated on the second-level
>   snapshot

The idea of allocation is orthogonal from the idea of reads as zeroes.
That is, a client may usefully guarantee that something reads as zeroes,
whether or not it is allocated (but knowing whether it is a hole or
allocated will determine whether future writes to that area will cause
file system fragmentation or be at risk of ENOSPC on thin-provisioning).
 If we want to expose the notion of depth (and I'm not sure about that
yet), we may want to reserve bit 0 for 'reads as zero' and bits 1-30 as
'allocated at depth "bit-1"' (and bit 31 as 'allocated at depth 30 or
greater).

I don't know if the idea of depth of allocation is useful enough to
expose in this manner; qemu could certainly advertise depth if the
protocol calls it out, but I'm still not sure whether knowing depth
helps any algorithms.

> 
> If all flags are cleared, that means the block is not allocated (i.e.,
> is a hole) and MUST read as zeroes.

That's too strong.  NBD_CMD_TRIM says that we can create holes whose
data does not necessarily read as zeroes (and SCSI definitely has
semantics like this - not all devices guarantee zero reads when you
UNMAP; and WRITE_SAME has an UNMAP flag to control whether you are okay
with the faster unmapping operation at the expense of bad reads, or
slower explicit writes).  Hence my complaint that we have to treat
'reads as zero' as an orthogonal bit to 'allocated at depth X'.

> 
> If a flag is set at a particular level X, that means the device is dirty
> at the Xth-level snapshot.
> 
> If at least one flag is set for a region, that means the data may read
> as "not zero".
> 
> The protocol does not define what it means to have multiple levels of
> snapshots, other than:
> 
> - Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
>   Xth level flag if the Yth level flag is not also cleared at the same
>   time, for any Y > X
> - Any write (as above) MAY set or clear multiple levels of flags at the
>   same time, as long as the above holds
> 
> Having a 32-bit snapshot status field allows for 32 levels of snapshots.
> We could switch length and flags to 64 bits so that things continue to
> align nicely, and then we have a maximum of 64 levels of snapshots.

64 bits may not lay out as nicely (a 12-byte struct is not as efficient
to copy between the wire and a C array as a 8-byte struct).

> 
> (I'm not going to write this up formally at this time of night, but you
> get the general idea)

The idea may make it possible to expose dirty information as a layer of
depth (from the qemu perspective, each qcow2 file would occupy 2 layers
of depth: one if dirty, and another if allocated; then deeper layers are
determined by backing files).  But I'm also worried that it may be more
complicated than the original question at hand (qemu wants to know,  in
advance of a read, which portions of a file are worth reading, because
they are either allocated, or because they are dirty; but doesn't care
to what depth the server has to go to actually perform the reads).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Wouter Verhelst
On Mon, Apr 04, 2016 at 10:54:02PM +0300, Denis V. Lunev wrote:
> saying about dirtiness, we would soon come to the fact, that
> we can have several dirtiness states regarding different
> lines of incremental backups. This complexity is hidden
> inside QEMU and it would be very difficult to publish and
> reuse it.

How about this then.

A reply to GET_BLOCK_STATUS containing chunks of this:

32-bit length
32-bit "snapshot status"
if bit 0 in the latter field is set, that means the block is allocated
  on the original device
if bit 1 is set, that means the block is allocated on the first-level
  snapshot
if bit 2 is set, that means the block is allocated on the second-level
  snapshot

etc

If all flags are cleared, that means the block is not allocated (i.e.,
is a hole) and MUST read as zeroes.

If a flag is set at a particular level X, that means the device is dirty
at the Xth-level snapshot.

If at least one flag is set for a region, that means the data may read
as "not zero".

The protocol does not define what it means to have multiple levels of
snapshots, other than:

- Any write command (WRITE or WRITE_ZEROES) MUST NOT clear or set the
  Xth level flag if the Yth level flag is not also cleared at the same
  time, for any Y > X
- Any write (as above) MAY set or clear multiple levels of flags at the
  same time, as long as the above holds

Having a 32-bit snapshot status field allows for 32 levels of snapshots.
We could switch length and flags to 64 bits so that things continue to
align nicely, and then we have a maximum of 64 levels of snapshots.

(I'm not going to write this up formally at this time of night, but you
get the general idea)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Eric Blake
On 04/04/2016 04:40 PM, Wouter Verhelst wrote:
> Hi,
> 
> Need to look into this in some detail, for which I don't have the time
> (or the non-tiredness ;-) right now, but these two caught my eye:
> 

>> +The payload is structured as a list of one or more descriptors,
>> +each with this layout:
>> +
>> +* 32 bits, length (unsigned, MUST NOT be zero)
>> +* 32 bits, status flags
>> +
>> +The definition of the status flags is determined based on the
>> +flags present in the original request.
> 
> Might be a good idea to specify what a client can do with flags it
> doesn't know about; ignore them, probably?

Sounds correct - so a server can subdivide into more descriptors with
additional status bits, and clients just have to ignore those extra bits
and coalesce things itself when dealing with the bits it cares about.

>  
> [...]
>> +The extension adds the following new command flag:
>> +
>> +- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
>> +  SHOULD be set to 1 if the client wants to request dirtiness status
>> +  rather than provisioning status.
> 
> Why only one flag here? I could imagine a client might want to query for
> both states at the same time. Obviously that means a client needs to
> query for *at least* one of the statuses, otherwise the server should
> reply with EINVAL.
> 
> Though I'm undecided on whether a bit being set to 0 should mean "give
> me that information" or whether 1 should.

Hmm. Based on that reading, maybe we want TWO flags:

NBD_CMD_FLAG_STATUS_ALLOCATED
NBD_CMD_FLAG_STATUS_DIRTY

and require that the client MUST set at least one flag (setting no flags
at all is boring), but similarly allow that the server MAY reject
attempts to set multiple flags with EINVAL (if there is no efficient way
to provide the information for all flags on a single pass), in which
case clients have to fall back to querying one flag at a time.

But while Alex and Denis were arguing that no one would ever query both
things at once (and therefore, it might be better to make
NBD_STATUS_HOLE and NBD_STATUS_CLEAN both be bit 0), your approach of
having two separate request flags and allowing both at once would mean
we do need to keep the status information separate (NBD_STATUS_HOLE is
bit 0, NBD_STATUS_CLEAN is bit 2).

I also see that I failed to reserve a bit for NBD_CMD_FLAG_STATUS_DIRTY.

Looks like there will still be some more conversation and at least a v3
needed, but I'll wait a couple days for that to happen so that more
reviewers can chime in, without being too tired during the review process.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Wouter Verhelst
Hi,

Need to look into this in some detail, for which I don't have the time
(or the non-tiredness ;-) right now, but these two caught my eye:

On Mon, Apr 04, 2016 at 10:39:10AM -0600, Eric Blake wrote:
[...]
> +* `NBD_REPLY_TYPE_BLOCK_STATUS`
> +
> +*length* MUST be a positive integer multiple of 8.  This reply
> +represents a series of consecutive block descriptors where the sum
> +of the lengths of the descriptors equals the length of the
> +original request.  This chunk type MUST appear at most once in a
> +structured reply. Valid as a reply to `NBD_CMD_BLOCK_STATUS.
> +
> +The payload is structured as a list of one or more descriptors,
> +each with this layout:
> +
> +* 32 bits, length (unsigned, MUST NOT be zero)
> +* 32 bits, status flags
> +
> +The definition of the status flags is determined based on the
> +flags present in the original request.

Might be a good idea to specify what a client can do with flags it
doesn't know about; ignore them, probably?
 
[...]
> +The extension adds the following new command flag:
> +
> +- `NBD_CMD_FLAG_STATUS_DIRTY`; valid during `NBD_CMD_BLOCK_STATUS`.
> +  SHOULD be set to 1 if the client wants to request dirtiness status
> +  rather than provisioning status.

Why only one flag here? I could imagine a client might want to query for
both states at the same time. Obviously that means a client needs to
query for *at least* one of the statuses, otherwise the server should
reply with EINVAL.

Though I'm undecided on whether a bit being set to 0 should mean "give
me that information" or whether 1 should.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Alex Bligh

On 4 Apr 2016, at 22:25, Eric Blake  wrote:

> On 04/04/2016 03:07 PM, Alex Bligh wrote:
>> 
>> On 4 Apr 2016, at 21:26, Eric Blake  wrote:
>> 
>>> Huh? The current proposal _requires_ these to be separate queries.  You
>>> cannot query dirtiness at the same time as allocation, because the value
>>> of NBD_CMD_FLAG_DIRTY is distinct between the two queries.
>> 
>> OK, so you can ask for either (1) or (2), but not both. I see. I misread
>> that. I still think Denis's idea of selecting a bitmap to query works better.
> 
> I'm still not sure I follow what you are proposing in 'selecting a
> bitmap', at least not without also needing to expand the protocol to
> allow for a structured command that has more than a fixed-length message
> size (currently all commands except NBD_CMD_WRITE) or a self-described
> size (NBD_CMD_WRITE).

Whether it's an integer, or a UTF-8 string or whatever, we'd need
a way to expand the command structure.

An easy way to do that would be a command flag 'NBD_CMD_FLAG_EXTENDED_COMMAND'
which means the command was immediately followed by

 32-bit: command extra-length
0004 variable length data

>  Would this bitmap id be specified as an integer
> (32 bits?) as an arbitrary UTF-8 string? or as something else?

Or a 128 bit unique identifier (i.e. a packed UUID) to identify a bitmap.
That would obviate the need for a registry of such things.

That or the UTF-8 string

>  And
> since we _already_ documented that querying dirty information requires
> out-of-band coordination, why can't the out-of-band communication convey
> the bitmap selection, so that the NBD command then just reads the dirty
> status of the already-selected bitmap?

I suppose I was trying to make a single 'read bitmap' command, that
read an arbitrary bitmap. We then define one 'well known' bitmap
as 'allocation'. You can have your qemu bitmap(s) to do whatever
you want.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Denis V. Lunev
On 04/05/2016 12:17 AM, Eric Blake wrote:
> On 04/04/2016 03:04 PM, Denis V. Lunev wrote:
>> In v1 we have had 'status' field, which can have the
>> following values for dirty request:
>>
>> +  - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
>> +  - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
>>
>> in the extent structure:
>>
>> +* 64 bits, offset (unsigned)
>> +* 32 bits, length (unsigned)
>> +* 16 bits, status (unsigned)
> Between v1 and v2, we dropped 64-bit offset (offset is now implied, by
> adding lengths of all earlier descriptors), and widened status from 16
> bits to 32 bits (so that each descriptor is now naturally 8-byte aligned
> and therefore easier to make a C array).
>
>> with an additional NBD_STATE_DIRTY_HOLE or (DIRTY_DEALLOCATED)
>> we could report the entire state using one query. The user could be
>> able to read entire state which is useful for backup at once.
>>
>> Your current proposal is more tricky and it was misunderstood by Alex:
>>
>> +* 32 bits, status flags
>>
>> and you describe flags as
>>
>> +  - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole
>> +  - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as
>> +  - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
>> and opinion of Alex was that all 3 bits could be set in reply to
>> NBD_CMD_BLOCK_STATUS
>> with NBD_CMD_FLAG_STATUS_DIRTY set.
>>
>> This confused him. This confuses me too.
> There's nothing that says that NBD_STATE_CLEAN can't be reassigned to
> bit 0.  Conversely, we may want to add a future NBD_CMD_FLAG_FOO that
> lets you read allocation and dirty information in the same call, in
> which case having the bits be distinct will make that easier; but where
> we would also make it obvious that the server is allowed to reject that
> command flag as unsupported (we already state the server can reject
> NBD_CMD_FLAG_STATUS_DIRTY with EINVAL as unsupported; and that it if
> does not reject a dirtiness request but cannot otherwise report
> anything, then the entire region is reported as dirty).
>
> I don't have any strong opinions on whether NBD_STATE_CLEAN should
> remain bit 2 or be renumbered to bit 0, although renumbering it to bit 0
> would make it painfully obvious that you cannot query allocation and
> dirtiness at the same time.
I think it is worth to do to avoid this type of the confusion.

>> If allocated state is not replied to command with NBD_CMD_FLAG_STATUS_DIRTY
>> then why to have different meaning of bits.
> Because we still have room - no need to overlap the meaning of bits as
> long as we have more bits to choose from.
>
ok

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Eric Blake
On 04/04/2016 03:07 PM, Alex Bligh wrote:
> 
> On 4 Apr 2016, at 21:26, Eric Blake  wrote:
> 
>> Huh? The current proposal _requires_ these to be separate queries.  You
>> cannot query dirtiness at the same time as allocation, because the value
>> of NBD_CMD_FLAG_DIRTY is distinct between the two queries.
> 
> OK, so you can ask for either (1) or (2), but not both. I see. I misread
> that. I still think Denis's idea of selecting a bitmap to query works better.

I'm still not sure I follow what you are proposing in 'selecting a
bitmap', at least not without also needing to expand the protocol to
allow for a structured command that has more than a fixed-length message
size (currently all commands except NBD_CMD_WRITE) or a self-described
size (NBD_CMD_WRITE).  Would this bitmap id be specified as an integer
(32 bits?) as an arbitrary UTF-8 string? or as something else?  And
since we _already_ documented that querying dirty information requires
out-of-band coordination, why can't the out-of-band communication convey
the bitmap selection, so that the NBD command then just reads the dirty
status of the already-selected bitmap?

It was Paolo's suggestion to document that querying dirty status is only
useful with out-of-band coordination, at which point, why bloat the NBD
protocol with details that are better left to that external
coordination?  Wouter had a valid complaint that v1 was unspecified (it
said that being "dirty" was implementation defined, but gave no other
meaning and a server could use random numbers and still be compliant);
v2 picked up Paolo's wording suggestion against v1 that tries to make it
obvious that being "dirty" is still implementation defined, but that the
definition is whatever it takes for a coordination with out-of-band
information to provide useful results.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Eric Blake
On 04/04/2016 03:04 PM, Denis V. Lunev wrote:
> In v1 we have had 'status' field, which can have the
> following values for dirty request:
> 
> +  - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> +  - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> 
> in the extent structure:
> 
> +* 64 bits, offset (unsigned)
> +* 32 bits, length (unsigned)
> +* 16 bits, status (unsigned)

Between v1 and v2, we dropped 64-bit offset (offset is now implied, by
adding lengths of all earlier descriptors), and widened status from 16
bits to 32 bits (so that each descriptor is now naturally 8-byte aligned
and therefore easier to make a C array).

> 
> with an additional NBD_STATE_DIRTY_HOLE or (DIRTY_DEALLOCATED)
> we could report the entire state using one query. The user could be
> able to read entire state which is useful for backup at once.
> 
> Your current proposal is more tricky and it was misunderstood by Alex:
> 
> +* 32 bits, status flags
> 
> and you describe flags as
> 
> +  - `NBD_STATE_HOLE` (bit 0); if set, the block represents a hole

> +  - `NBD_STATE_ZERO` (bit 1), if set, the block contents read as

> 
> +  - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a

> and opinion of Alex was that all 3 bits could be set in reply to
> NBD_CMD_BLOCK_STATUS
> with NBD_CMD_FLAG_STATUS_DIRTY set.
> 
> This confused him. This confuses me too.

There's nothing that says that NBD_STATE_CLEAN can't be reassigned to
bit 0.  Conversely, we may want to add a future NBD_CMD_FLAG_FOO that
lets you read allocation and dirty information in the same call, in
which case having the bits be distinct will make that easier; but where
we would also make it obvious that the server is allowed to reject that
command flag as unsupported (we already state the server can reject
NBD_CMD_FLAG_STATUS_DIRTY with EINVAL as unsupported; and that it if
does not reject a dirtiness request but cannot otherwise report
anything, then the entire region is reported as dirty).

I don't have any strong opinions on whether NBD_STATE_CLEAN should
remain bit 2 or be renumbered to bit 0, although renumbering it to bit 0
would make it painfully obvious that you cannot query allocation and
dirtiness at the same time.

> 
> If allocated state is not replied to command with NBD_CMD_FLAG_STATUS_DIRTY
> then why to have different meaning of bits.

Because we still have room - no need to overlap the meaning of bits as
long as we have more bits to choose from.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Alex Bligh

On 4 Apr 2016, at 21:34, Eric Blake  wrote:

> The original design abused the 16-bit 'flags' field of each command to
> instead try and treat that value as a bitmap number, instead of a
> bitwise-or'd set of flags.  That was one of the complaints against v1,
> and was fixed in v2 by having a single boolean flag, NBD_CMD_FLAG_DIRTY,
> which was off for (default) allocation queries, and set for dirtiness
> queries.  We can add other flags for any other type of queries, and the
> principle of each query being a run-length-encoded listing still applies.

Well abusing flags is pretty gross.

You're multiplexing on a single flag and I didn't like that.

So I suggested multiplexing on the command and you didn't like that.

Eventually as you point out we really want to expand the command field,
and ...

> As it is, we don't have structured writes - right now, you can write a

... I agree this is almost inventing 'structured writes' :-)

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Alex Bligh

On 4 Apr 2016, at 21:26, Eric Blake  wrote:

> Huh? The current proposal _requires_ these to be separate queries.  You
> cannot query dirtiness at the same time as allocation, because the value
> of NBD_CMD_FLAG_DIRTY is distinct between the two queries.

OK, so you can ask for either (1) or (2), but not both. I see. I misread
that. I still think Denis's idea of selecting a bitmap to query works better.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Denis V. Lunev
On 04/04/2016 11:34 PM, Eric Blake wrote:
> On 04/04/2016 02:08 PM, Denis V. Lunev wrote:
>>> This again makes me think this should be a different
>>> command from something which is obviously useful and
>>> comprehensible to more than one server/client (i.e.
>>> allocation).
>>>
>> original design of this command has used 16 number
>> to specify the NUMBER of the bitmap which was
>> exported by the server.
> The original design abused the 16-bit 'flags' field of each command to
> instead try and treat that value as a bitmap number, instead of a
> bitwise-or'd set of flags.  That was one of the complaints against v1,
> and was fixed in v2 by having a single boolean flag, NBD_CMD_FLAG_DIRTY,
> which was off for (default) allocation queries, and set for dirtiness
> queries.  We can add other flags for any other type of queries, and the
> principle of each query being a run-length-encoded listing still applies.
>
>> We have reserved number 0 for 'used' bitmap, i.e.
>> bitmap of allocated blocks and number 1 for 'dirty'
>> bitmap. Though we can skip specification of the
>> belonging of any number except '0' and put them
>> to server-client negotiations. Or we could reserve
>> '1' for dirtiness state as server-client agrees and
>> allow other applications to register their own bitmaps
>> as the deserve to.
>>
>> Why not to do things this original way?
> If you want to encode particular ids, you should do so in a separate
> field, and not overload the 'flags' field.
>
> As it is, we don't have structured writes - right now, you can write a
> wire sniffer for the client side, where all commands except
> NBD_CMD_WRITE are fixed size, and NBD_CMD_WRITE describes its own size
> via its length field; the extension NBD_CMD_WRITE_ZEROES still fits into
> this scheme.  All NBD implementations have to supply NBD_CMD_WRITE, but
> any extension commands do NOT have to be universal.  Writing a wire
> sniffer that special-cases NBD_CMD_WRITE is easy (since that command
> will always exist), but writing a wire sniffer that special-cases
> arbitrary commands, particularly where those arbitrary commands do not
> also self-describe the length of the command, is hard.  We can't
> overload the flags field to say which bitmap id to grab, but we also
> can't arbitrarily add 4 bytes to the command size when the command is
> NBD_CMD_BLOCK_STATUS (because wire sniffers that don't know about
> NBD_CMD_BLOCK_STATUS wouldn't know to expect those four bytes to be part
> of the current packet rather than starting a new packet).
>
> The recent work on structured reads made it possible for an arbitrary
> wire sniffer to gracefully skip over the variable-length return size
> reply to NBD_CMD_BLOCK_STATUS, and any other extension command that we
> might add later.  But right now, I'm not seeing a compelling reason to
> add structured commands to the NBD protocol.
>
thank you for pointing this out. I think I got an idea

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Eric Blake
On 04/04/2016 02:27 PM, Denis V. Lunev wrote:
> On 04/04/2016 11:15 PM, Alex Bligh wrote:
>> On 4 Apr 2016, at 21:13, Denis V. Lunev  wrote:
>>
>>> As far as I remember that text we have had a number in request
>>> specifying which bitmap to query and the server should reply with one
>>> bitmap at a time.
>>>
>>> Would this work for you?
>> I think that would be much better, yes, though I'd suggest the
>> bitmap had an ID other than a number 0..15 so other people
>> can use it too.
>>
> bitmap requires to negotiate granularity which is
> not that easy thing.
> 
> If we have different granularities for 'dirty' and 'allocated'
> bitmaps and we can report this using this proto and
> can not do this cleanly with bitmap approach assuming
> that we will add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) state

I'm not sure what you are trying to propose adding here.  'state' is a
bitmap - it is either a representation of two bits of information
(NBD_CMD_FLAG_DIRTY was clear, so state is the bitwise OR of
NBD_STATE_HOLE and NBD_STATE_ZERO), or the representation of one bit of
information (NBD_CMD_FLAG_DIRTY was set, so state is the bitwise OR of
NBD_STATE_CLEAN).

I'm not sure where you are reading into this that granularity has to be
negotiated.  The client never mentions granularity; and the server is
perfectly free to report data in descriptors as large or as small as it
wants (although I did document that the server SHOULD stick to
descriptors that are at least 512 bytes at a time, and SHOULD coalese
descriptors so that two consecutive descriptors have distinct state values).

Whether something is allocated or not has no direct bearing on whether
it is dirty or not; and it is feasible that a server could report the
act of NBD_CMD_TRIM as causing a portion of the file to become dirty.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Denis V. Lunev
On 04/04/2016 11:08 PM, Alex Bligh wrote:
> On 4 Apr 2016, at 21:04, Denis V. Lunev  wrote:
>
>>> Sure, but given you can't report dirtiness without also reporting
>>> allocation, if they are are at different blocksize I'd rather they
>>> were in different commands, because otherwise the code to report
>>> block size needs to work at two different granularities.
>>>
>> 'dirty' could come after the data was 'trimmed' from the region!
>> thus we could have dirty unallocated data.
> Let me rephrase.
>
> Under the current proposal it is not possible to report whether or
> not a region is dirty without also reporting whether or not it
> is allocated. As these two concepts exist at potentially
> different block sizes, the code to support reporting on allocation
> must now be able to run both at the allocation blocksize and
> the dirtiness blocksize, which is going to be a pain.
>
> If these were two different commands, they could each run at their
> natural block size.
>
could you look into V1 of this?

As far as I remember that text we have had a number in request
specifying which bitmap to query and the server should reply with one
bitmap at a time.

Would this work for you?

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Denis V. Lunev
On 04/04/2016 11:15 PM, Alex Bligh wrote:
> On 4 Apr 2016, at 21:13, Denis V. Lunev  wrote:
>
>> As far as I remember that text we have had a number in request
>> specifying which bitmap to query and the server should reply with one
>> bitmap at a time.
>>
>> Would this work for you?
> I think that would be much better, yes, though I'd suggest the
> bitmap had an ID other than a number 0..15 so other people
> can use it too.
>
bitmap requires to negotiate granularity which is
not that easy thing.

If we have different granularities for 'dirty' and 'allocated'
bitmaps and we can report this using this proto and
can not do this cleanly with bitmap approach assuming
that we will add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) state

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Eric Blake
On 04/04/2016 01:58 PM, Alex Bligh wrote:
> Eric,
> 
>> Nothing requires the two uses to report at the same granularity.  THe
>> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
>> as it sees fit (so it could report holes at a 4k granularity, but
>> dirtiness only at a 64k granularity) - all that matters is that when all
>> the descriptors have been sent, they total up to the length of the
>> original client request.  So by itself, granularity does not require
>> another command.
> 
> Sure, but given you can't report dirtiness without also reporting
> allocation, if they are are at different blocksize I'd rather they
> were in different commands, because otherwise the code to report
> block size needs to work at two different granularities.

We already state the client has to make two queries.  The client is NOT
querying block size, but a map of contiguous bytes of the file with the
same properties.

As an example usage,

C: NBD_CMD_BLOCK_SIZE: length 128k, offset 0, flags 0
S: NBD_REPLY_TYPE_BLOCK_SIZE: length 32 (4 descriptors):
 length 16384, status 0
 length 16384, status NBD_STATE_HOLE|NBD_STATE_ZERO
 length 81920, status 0
 length 16384, status NBD_STATE_ZERO
C: NBD_CMD_BLOCK_SIZE: length 128k, offset 0, flags NBD_CMD_FLAG_DIRTY
S: NBD_REPLY_TYPE_BLOCK_SIZE: length 16 (2 descriptors):
 length 65536, status 0
 length 65536, status NBD_STATE_CLEAN

would be a perfectly valid sequence of replies.  It tells the client
nothing about the guest block size (that information would be the same
whether the guest uses 512-byte, 1024-byte, or 4096-byte sectors?).  In
fact, you can't even tell if it is possible to track dirty information
in a granularity smaller than 64k, only that the results of this command
did not have anything smaller than that.  The command intentionally
separates modes of operation (you can't query allocation and
NBD_CMD_FLAG_DIRTY at the same time), in case the server has different
block size code under the hood for the two types of queries.

> 
>> At this point, I was just trying to rebase the proposal as originally
>> made by Denis and Pavel; perhaps they will have more insight on how they
>> envisioned using the command, or on whether we should try harder to make
>> this more of a two-way protocol (where the client can tell the server
>> when to mark something as clean, or when to start tracking whether
>> something is dirty).
> 
> I'm not sure it needs to be a two way protocol (see my strawman) but
> I'd like to know it's at least possibly useful.

Tracking generation ids on every sector is one implementation, but it is
not currently used by qemu for qcow2 images.  So forcing the
implementation to be exposed by having NBD track dirty information by
generation id would require qemu to start tracking new information that
it does not currently have.  Qemu's current implementation of dirty
information is a bitmap with no generation id, so the idea is that the
NBD command for exposing dirty sectors should likewise be no more
specific than a listing of run-length-encoded alternations between "this
part of the file is clean" and "this part of the file is dirty".

Even for an implementation that DOES track generation id on every
sector, you still have the case that you need out-of-band communication
to make use of it.  So the client would first send an out-of-band
message "set generation 123 as my line in the sand", then use NBD to
query dirty sectors (which returns clean for all sectors with id < 123,
and dirty for all sectors >= 123).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Denis V. Lunev
On 04/04/2016 11:03 PM, Alex Bligh wrote:
> On 4 Apr 2016, at 20:54, Denis V. Lunev  wrote:
>
>> for now and for QEMU we want this to expose accumulated dirtiness
>> of the block device, which is collected by the server. Yes, this requires
>> external coordination. May be this COULD be the part of the protocol,
>> but QEMU will not use that part of the protocol.
>>
>> saying about dirtiness, we would soon come to the fact, that
>> we can have several dirtiness states regarding different
>> lines of incremental backups. This complexity is hidden
>> inside QEMU and it would be very difficult to publish and
>> reuse it.
> So qemu-nbdserver has some idea of 'dirtiness', and you want
> to publish that, and the consumer is qemu (and only qemu),
> because only qemu knows what 'dirtiness' means in the
> sense the server provides it?
>
> I've nothing against helping qemu out here (having
> contributed to qemu myself), but perhaps it might be better then,
> rather than call it 'dirtiness' to make it some named attribute
> for private client/server communication.
>
> This again makes me think this should be a different
> command from something which is obviously useful and
> comprehensible to more than one server/client (i.e.
> allocation).
>
original design of this command has used 16 number
to specify the NUMBER of the bitmap which was
exported by the server.

We have reserved number 0 for 'used' bitmap, i.e.
bitmap of allocated blocks and number 1 for 'dirty'
bitmap. Though we can skip specification of the
belonging of any number except '0' and put them
to server-client negotiations. Or we could reserve
'1' for dirtiness state as server-client agrees and
allow other applications to register their own bitmaps
as the deserve to.

Why not to do things this original way?

Den

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Alex Bligh

On 4 Apr 2016, at 21:04, Denis V. Lunev  wrote:

>> Sure, but given you can't report dirtiness without also reporting
>> allocation, if they are are at different blocksize I'd rather they
>> were in different commands, because otherwise the code to report
>> block size needs to work at two different granularities.
>> 
> 'dirty' could come after the data was 'trimmed' from the region!
> thus we could have dirty unallocated data.

Let me rephrase.

Under the current proposal it is not possible to report whether or
not a region is dirty without also reporting whether or not it
is allocated. As these two concepts exist at potentially
different block sizes, the code to support reporting on allocation
must now be able to run both at the allocation blocksize and
the dirtiness blocksize, which is going to be a pain.

If these were two different commands, they could each run at their
natural block size.

-- 
Alex Bligh





--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Denis V. Lunev
On 04/04/2016 10:58 PM, Alex Bligh wrote:
> Eric,
>
>> Nothing requires the two uses to report at the same granularity.  THe
>> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
>> as it sees fit (so it could report holes at a 4k granularity, but
>> dirtiness only at a 64k granularity) - all that matters is that when all
>> the descriptors have been sent, they total up to the length of the
>> original client request.  So by itself, granularity does not require
>> another command.
> Sure, but given you can't report dirtiness without also reporting
> allocation, if they are are at different blocksize I'd rather they
> were in different commands, because otherwise the code to report
> block size needs to work at two different granularities.
>
'dirty' could come after the data was 'trimmed' from the region!
thus we could have dirty unallocated data.

>> At this point, I was just trying to rebase the proposal as originally
>> made by Denis and Pavel; perhaps they will have more insight on how they
>> envisioned using the command, or on whether we should try harder to make
>> this more of a two-way protocol (where the client can tell the server
>> when to mark something as clean, or when to start tracking whether
>> something is dirty).
> I'm not sure it needs to be a two way protocol (see my strawman) but
> I'd like to know it's at least possibly useful.
>
> --
> Alex Bligh
>
>
>
>


--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Alex Bligh

On 4 Apr 2016, at 20:54, Denis V. Lunev  wrote:

> for now and for QEMU we want this to expose accumulated dirtiness
> of the block device, which is collected by the server. Yes, this requires
> external coordination. May be this COULD be the part of the protocol,
> but QEMU will not use that part of the protocol.
> 
> saying about dirtiness, we would soon come to the fact, that
> we can have several dirtiness states regarding different
> lines of incremental backups. This complexity is hidden
> inside QEMU and it would be very difficult to publish and
> reuse it.

So qemu-nbdserver has some idea of 'dirtiness', and you want
to publish that, and the consumer is qemu (and only qemu),
because only qemu knows what 'dirtiness' means in the
sense the server provides it?

I've nothing against helping qemu out here (having
contributed to qemu myself), but perhaps it might be better then,
rather than call it 'dirtiness' to make it some named attribute
for private client/server communication.

This again makes me think this should be a different
command from something which is obviously useful and
comprehensible to more than one server/client (i.e.
allocation).

-- 
Alex Bligh





--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Alex Bligh
Eric,

> Nothing requires the two uses to report at the same granularity.  THe
> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
> as it sees fit (so it could report holes at a 4k granularity, but
> dirtiness only at a 64k granularity) - all that matters is that when all
> the descriptors have been sent, they total up to the length of the
> original client request.  So by itself, granularity does not require
> another command.

Sure, but given you can't report dirtiness without also reporting
allocation, if they are are at different blocksize I'd rather they
were in different commands, because otherwise the code to report
block size needs to work at two different granularities.

> At this point, I was just trying to rebase the proposal as originally
> made by Denis and Pavel; perhaps they will have more insight on how they
> envisioned using the command, or on whether we should try harder to make
> this more of a two-way protocol (where the client can tell the server
> when to mark something as clean, or when to start tracking whether
> something is dirty).

I'm not sure it needs to be a two way protocol (see my strawman) but
I'd like to know it's at least possibly useful.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-04 Thread Denis V. Lunev
On 04/04/2016 10:34 PM, Eric Blake wrote:
> On 04/04/2016 12:06 PM, Alex Bligh wrote:
>> On 4 Apr 2016, at 17:39, Eric Blake  wrote:
>>
>>> +This command is meant to operate in tandem with other (non-NBD)
>>> +channels to the server.  Generally, a "dirty" block is a block
>>> +that has been written to by someone, but the exact meaning of "has
>>> +been written" is left to the implementation.  For example, a
>>> +virtual machine monitor could provide a (non-NBD) command to start
>>> +tracking blocks written by the virtual machine.  A backup client
>>> +can then connect to an NBD server provided by the virtual machine
>>> +monitor and use `NBD_CMD_BLOCK_STATUS` with the
>>> +`NBD_FLAG_STATUS_DIRTY` bit set in order to read only the dirty
>>> +blocks that the virtual machine has changed.
>>> +
>>> +An implementation that doesn't track the "dirtiness" state of
>>> +blocks MUST either fail this command with `EINVAL`, or mark all
>>> +blocks as dirty in the descriptor that it returns.  Upon receiving
>>> +an `NBD_CMD_BLOCK_STATUS` command with the flag
>>> +`NBD_FLAG_STATUS_DIRTY` set, the server MUST return the dirtiness
>>> +status of the device, where the status field of each descriptor is
>>> +determined by the following bit:
>>> +
>>> +  - `NBD_STATE_CLEAN` (bit 2); if set, the block represents a
>>> +portion of the file that is still clean because it has not
>>> +been written; if clear, the block represents a portion of the
>>> +file that is dirty, or where the server could not otherwise
>>> +determine its status.
>> A couple of questions:
>>
>> 1. I am not sure that the block dirtiness and the zero/allocation/hole thing
>> always have the same natural blocksize. It's pretty easy to imagine
>> a server whose natural blocksize is a disk sector (and can therefore
>> report presence of zeroes to that resolution) but where 'dirtiness'
>> was maintained independently at a less fine-grained level. Maybe
>> that suggests 2 commands would be useful.
> In fact, qemu does just that with qcow2 images - the user can request a
> dirtiness granularity that is much larger than cluster granularity
> (where clusters are the current limitation on reporting holes, but where
> Kevin Wolf has an idea about a potential qcow2 extension that would even
> let us report holes at a sector granularity).
>
> Nothing requires the two uses to report at the same granularity.  THe
> NBD_REPLY_TYPE_BLOCK_STATUS allows the server to divide into descriptors
> as it sees fit (so it could report holes at a 4k granularity, but
> dirtiness only at a 64k granularity) - all that matters is that when all
> the descriptors have been sent, they total up to the length of the
> original client request.  So by itself, granularity does not require
> another command.
exactly!


>> 2. Given the communication is out of band, how is it realistically
>> possible to sync this backup? You'll ask for all the dirty blocks,
>> but whilst the command is being executed (as well as immediately
>> after the reply) further blocks may be dirtied. So your reply
>> always overestimates what is clean (probably the wrong way around).
>> Furthermore, the next time you do a 'backup', you don't know whether
>> the blocks were dirty as they were dirty on the previous backup,
>> or because they were dirty on this backup.
> You are correct that as a one-way operation, querying dirtiness is not
> very useful if there is not a way to mark something clean, or if
> something else can be dirtying things in parallel.  But that doesn't
> mean the command is not useful - if the NBD server is exporting a file
> as read-only, where nothing else can be dirtying it in parallel, then a
> single pass over the dirty information is sufficient to learn what
> portions of the file to copy out.
>
> At this point, I was just trying to rebase the proposal as originally
> made by Denis and Pavel; perhaps they will have more insight on how they
> envisioned using the command, or on whether we should try harder to make
> this more of a two-way protocol (where the client can tell the server
> when to mark something as clean, or when to start tracking whether
> something is dirty).
for now and for QEMU we want this to expose accumulated dirtiness
of the block device, which is collected by the server. Yes, this requires
external coordination. May be this COULD be the part of the protocol,
but QEMU will not use that part of the protocol.

saying about dirtiness, we would soon come to the fact, that
we can have several dirtiness states regarding different
lines of incremental backups. This complexity is hidden
inside QEMU and it would be very difficult to publish and
reuse it.


>> If I was designing a backup protocol (off the top of my head) I'd
>> make all commands return a monotonic 64 bit counter of the number of
>> writes to the disk