Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-08 Thread Alex Bligh

> On 8 Dec 2016, at 06:58, Vladimir Sementsov-Ogievskiy 
>  wrote:
> 
> An idea: let's not use uppercase. Why to shout the namespace? 'base' and 'x-' 
> would be better I think. BASE and X- will provoke all user-defined namespaces 
> be in uppercase too and a lot of uppercase will come to the code =(

I agree

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-07 Thread Vladimir Sementsov-Ogievskiy

08.12.2016 06:39, Alex Bligh wrote:

[...]


are vendor dependent extensions. The 'BASE' namespace represents
metadata contexts defined within this document. Other namespaces
may be registered by reference within this document.

+The "BASE:allocation"

Backticks: `BASE:allocation`


An idea: let's not use uppercase. Why to shout the namespace? 'base' and 
'x-' would be better I think. BASE and X- will provoke all user-defined 
namespaces be in uppercase too and a lot of uppercase will come to the 
code =(



--
Best regards,
Vladimir




Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-07 Thread Alex Bligh

> On 2 Dec 2016, at 18:45, Alex Bligh  wrote:
> 
> Thanks. That makes sense - or enough sense for me to carry on commenting!


I finally had some time to go through this extension in detail. Rather
than comment on all the individual patches, I squashed them into a single
commit, did a 'git format-patch' on it, and commented on that.

diff --git a/doc/proto.md b/doc/proto.md
index c443494..9c0981f 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -871,6 +869,50 @@ of the newstyle negotiation.

Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).

+- `NBD_OPT_META_CONTEXT` (10)
+
+Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
+followed by an `NBD_REP_ACK`. If a server replies to such a request
+with no error message, clients

"*the* server" / "*the* cient"

Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the
client querying the server.

+MAY send NBD_CMD_BLOCK_STATUS
+commands during the transmission phase.

Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP,
the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the
transmission phase."

+
+If the query string is syntactically invalid, the server SHOULD send
+`NBD_REP_ERR_INVALID`.

'MUST send' else it implies sending nothing is permissible.

+If the query string is syntactically valid
+but finds no metadata contexts, the server MUST send a single
+reply of type `NBD_REP_ACK`.
+
+This option MUST NOT be requested unless structured replies have

Active voice better:

"The client MUST NOT send this option unless" ...

+been negotiated first. If a client attempts to do so, a server
+SHOULD send `NBD_REP_ERR_INVALID`.
+
+Data:
+- 32 bits, type
+- String, query to select a subset of the available metadata
+  contexts. If this is not specified (i.e., length is 4 and no
+  command is sent), then the server MUST send all the metadata
+  contexts it knows about. If specified, this query string MUST
+  start with a name that uniquely identifies a server
+  implementation; e.g., the reference implementation that
+  accompanies this document would support query strings starting
+  with 'nbd-server:'

Why not just define the format of a metadata-context string to be
of the form ':' (perhaps this definition goes
elsewhere), and here just say the returned list is a left-match
of all the available metadata-contexts, i.e. all those metadata
contexts whose names either start or consist entirely of the
specified string. If an empty string is specified, all metadata
contexts are returned.

+
+The type may be one of:
+- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
+  selected by the query string is returned to the client without
+  changing any state (i.e., this does not add metadata contexts
+  for further usage).

Somewhere it should say this list is returned by sending
zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK.

+- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
+  selected by the query string is added to the list of existing
+  metadata contexts.
+- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
+  selected by the query string is removed from the list of used
+  metadata contexts. Servers SHOULD NOT reuse existing metadata
+  context IDs.
+
+The syntax of the query string is not specified, except that
+implementations MUST support adding and removing individual metadata
+contexts by simply listing their names.

This seems slightly over complicated. Rather than have a list held
by the server of active metadata contexts, we could simply have
two NBD_OPT_ commands, say NBD_OPT_LIST_META_CONTEXTS and
NBD_OPT_SET_META_CONTEXTS (which simply sets a list). Then no
need for 'type', and _ADD_ and _DEL_.

 Option reply types

These values are used in the "reply type" field, sent by the server
@@ -882,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
information is available, or when sending data related to the option
(in the case of `NBD_OPT_LIST`) has finished. No data.

...

+- `NBD_REP_META_CONTEXT` (4)
+
+A description of a metadata context. Data:
+
+- 32 bits, NBD metadata context ID.
+- String, name of the metadata context. This is not required to be
+  a human-readable string, but it MUST be valid UTF-8 data.

I would suggest puttig in a length of the string before the string,
which will allow us to expand this later to add fields if necessary.
This seems to be what we are doing elsewhere.

+
+This specification declares one metadata context. It is called
+"BASE:allocation" and contains the basic "exists at all" context.
+
There are a number of error reply types, all of which are denoted by
having bit 31 set. All error replies MAY have some data set, in which
case that 

Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-05 Thread Vladimir Sementsov-Ogievskiy

02.12.2016 23:39, John Snow wrote:


On 12/02/2016 01:45 PM, Alex Bligh wrote:

John,


+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.

How can data be 'dirty' if it is static and unchangeable? (I thought)


In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
the VM and cause the bitmap that QEMU manages to become dirty.

We intend to expose the ability to fleece dirty blocks via NBD. What
happens in this scenario would be that a snapshot of the data at the
time of the request is exported over NBD in a read-only manner.

In this way, the drive itself is R/W, but the "view" of it from NBD is
RO. While a hypothetical backup client is busy copying data out of this
temporary view, new writes are coming in to the drive, but are not being
exposed through the NBD export.

(This goes into QEMU-specifics, but those new writes are dirtying a
version of the bitmap not intended to be exposed via the NBD channel.
NBD gets effectively a snapshot of both the bitmap AND the data.)

Thanks. That makes sense - or enough sense for me to carry on commenting!


Whew! I'm glad.


I now think what you are talking about backing up a *snapshot* of a disk
that's running, where the disk itself was not connected using NBD? IE it's
not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
an opaque state represented in a bitmap, which is binary metadata
at some particular level of granularity. It might as well be 'happiness'
or 'is coloured blue'. The NBD server would (normally) have no way of
manipulating this bitmap.

In previous comments, I said 'how come we can set the dirty bit through
writes but can't clear it?'. This (my statement) is now I think wrong,
as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
state of the bitmap comes from whatever sets the bitmap which is outside
the scope of this protocol to transmit it.


You know, this is a fair point. We have not (to my knowledge) yet
carefully considered the exact bitmap management scenario when NBD is
involved in retrieving dirty blocks.

Humor me for a moment while I talk about a (completely hypothetical, not
yet fully discussed) workflow for how I envision this feature.

(1) User sets up a drive in QEMU, a bitmap is initialized, an initial
backup is made, etc.

(2) As writes come in, QEMU's bitmap is dirtied.

(3) The user decides they want to root around to see what data has
changed and would like to use NBD to do so, in contrast to QEMU's own
facilities for dumping dirty blocks.

(4) A command is issued that creates a temporary, lightweight snapshot
('fleecing') and exports this snapshot over NBD. The bitmap is
associated with the NBD export at this point at NBD server startup. (For
the sake of QEMU discussion, maybe this command is "blockdev-fleece")

(5) At this moment, the snapshot is static and represents the data at
the time the NBD server was started. The bitmap is also forked and
represents only this snapshot. The live data and bitmap continue to change.

(6) Dirty blocks are queried and copied out via NBD.

(7) The user closes the NBD instance upon completion of their task,
whatever it was. (Making a new incremental backup? Just taking a peek at
some changed data? who knows.)

The point that's interesting here is what do we do with the two bitmaps
at this point? The data delta can be discarded (this was after all just
a lightweight read-only point-in-time snapshot) but the bitmap data
needs to be dealt with.

(A) In the case of "User made a new incremental backup," the bitmap that
got forked off to serve the NBD read should be discarded.

(B) In the case of "User just wanted to look around," the bitmap should
be merged back into the bitmap it was forked from.

I don't advise a hybrid where "User copied some data, but not all" where
we need to partially clear *and* merge, but conceivably this could
happen, because the things we don't want to happen always will.

At this point maybe it's becoming obvious that actually it would be very
prudent to allow the NBD client itself to inform QEMU via the NBD
protocol which extents/blocks/(etc) that it is "done" with.

Maybe it *would* actually be useful if, in NBD allowing us to add a
"dirty" bit to the specification, we allow users to clear those bits.

Then, whether the user was trying to do (A) or (B) or the unspeakable
amalgamation of both things, it's up to the user to clear the bits
desired and QEMU can do the simple task of simply always merging the
bitmap fork upon the conclusion of the NBD fleecing exercise.

Maybe this would allow the dirty bit to have a bit more concrete meaning
for the NBD spec: "The bit stays dirty until the user clears it, and is
set when the matching 

Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-03 Thread Alex Bligh

> On 2 Dec 2016, at 20:39, John Snow  wrote:
> 
> OK. We do certainly support multiple bitmaps being active at a time in
> QEMU, but I had personally always envisioned that you'd associate them
> one-at-a-time when starting the NBD export of a particular device.
> 
> I don't have a use case in my head where two distinct bitmaps being
> exposed simultaneously offer any particular benefit, but maybe there is
> something. I'm sure there is.

The obvious case is an allocation bitmap, and a dirty bitmap.

It's possible one might want more than one dirty bitmap at once.
Perhaps two sorts of backup, or perhaps live migrate of storage
backed by NBD, or perhaps inspecting the *state* of live migration
via NBD and a bitmap, or perhaps determining what extents in
a QCOW image are in that image itself (as opposed to the image
on which it is based).

I tried to pick some QEMU-like ones, but I am sure there are
examples that would work outside of QEMU.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-02 Thread John Snow


On 12/02/2016 01:45 PM, Alex Bligh wrote:
> John,
> 
>>> +Some storage formats and operations over such formats express a
>>> +concept of data dirtiness. Whether the operation is block device
>>> +mirroring, incremental block device backup or any other operation with
>>> +a concept of data dirtiness, they all share a need to provide a list
>>> +of ranges that this particular operation treats as dirty.
>>>
>>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>>>
>>
>> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
>> the VM and cause the bitmap that QEMU manages to become dirty.
>>
>> We intend to expose the ability to fleece dirty blocks via NBD. What
>> happens in this scenario would be that a snapshot of the data at the
>> time of the request is exported over NBD in a read-only manner.
>>
>> In this way, the drive itself is R/W, but the "view" of it from NBD is
>> RO. While a hypothetical backup client is busy copying data out of this
>> temporary view, new writes are coming in to the drive, but are not being
>> exposed through the NBD export.
>>
>> (This goes into QEMU-specifics, but those new writes are dirtying a
>> version of the bitmap not intended to be exposed via the NBD channel.
>> NBD gets effectively a snapshot of both the bitmap AND the data.)
> 
> Thanks. That makes sense - or enough sense for me to carry on commenting!
> 

Whew! I'm glad.

>>> I now think what you are talking about backing up a *snapshot* of a disk
>>> that's running, where the disk itself was not connected using NBD? IE it's
>>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is 
>>> effectively
>>> an opaque state represented in a bitmap, which is binary metadata
>>> at some particular level of granularity. It might as well be 'happiness'
>>> or 'is coloured blue'. The NBD server would (normally) have no way of
>>> manipulating this bitmap.
>>>
>>> In previous comments, I said 'how come we can set the dirty bit through
>>> writes but can't clear it?'. This (my statement) is now I think wrong,
>>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>>> state of the bitmap comes from whatever sets the bitmap which is outside
>>> the scope of this protocol to transmit it.
>>>
>>
>> You know, this is a fair point. We have not (to my knowledge) yet
>> carefully considered the exact bitmap management scenario when NBD is
>> involved in retrieving dirty blocks.
>>
>> Humor me for a moment while I talk about a (completely hypothetical, not
>> yet fully discussed) workflow for how I envision this feature.
>>
>> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
>> backup is made, etc.
>>
>> (2) As writes come in, QEMU's bitmap is dirtied.
>>
>> (3) The user decides they want to root around to see what data has
>> changed and would like to use NBD to do so, in contrast to QEMU's own
>> facilities for dumping dirty blocks.
>>
>> (4) A command is issued that creates a temporary, lightweight snapshot
>> ('fleecing') and exports this snapshot over NBD. The bitmap is
>> associated with the NBD export at this point at NBD server startup. (For
>> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
>>
>> (5) At this moment, the snapshot is static and represents the data at
>> the time the NBD server was started. The bitmap is also forked and
>> represents only this snapshot. The live data and bitmap continue to change.
>>
>> (6) Dirty blocks are queried and copied out via NBD.
>>
>> (7) The user closes the NBD instance upon completion of their task,
>> whatever it was. (Making a new incremental backup? Just taking a peek at
>> some changed data? who knows.)
>>
>> The point that's interesting here is what do we do with the two bitmaps
>> at this point? The data delta can be discarded (this was after all just
>> a lightweight read-only point-in-time snapshot) but the bitmap data
>> needs to be dealt with.
>>
>> (A) In the case of "User made a new incremental backup," the bitmap that
>> got forked off to serve the NBD read should be discarded.
>>
>> (B) In the case of "User just wanted to look around," the bitmap should
>> be merged back into the bitmap it was forked from.
>>
>> I don't advise a hybrid where "User copied some data, but not all" where
>> we need to partially clear *and* merge, but conceivably this could
>> happen, because the things we don't want to happen always will.
>>
>> At this point maybe it's becoming obvious that actually it would be very
>> prudent to allow the NBD client itself to inform QEMU via the NBD
>> protocol which extents/blocks/(etc) that it is "done" with.
>>
>> Maybe it *would* actually be useful if, in NBD allowing us to add a
>> "dirty" bit to the specification, we allow users to clear those bits.
>>
>> Then, whether the user was trying to do (A) or (B) or the unspeakable
>> amalgamation of both things, it's up to the user to clear the bits
>> desired and QEMU can do the simple task of simply 

Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-02 Thread Alex Bligh
John,

>> +Some storage formats and operations over such formats express a
>> +concept of data dirtiness. Whether the operation is block device
>> +mirroring, incremental block device backup or any other operation with
>> +a concept of data dirtiness, they all share a need to provide a list
>> +of ranges that this particular operation treats as dirty.
>> 
>> How can data be 'dirty' if it is static and unchangeable? (I thought)
>> 
> 
> In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
> the VM and cause the bitmap that QEMU manages to become dirty.
> 
> We intend to expose the ability to fleece dirty blocks via NBD. What
> happens in this scenario would be that a snapshot of the data at the
> time of the request is exported over NBD in a read-only manner.
> 
> In this way, the drive itself is R/W, but the "view" of it from NBD is
> RO. While a hypothetical backup client is busy copying data out of this
> temporary view, new writes are coming in to the drive, but are not being
> exposed through the NBD export.
> 
> (This goes into QEMU-specifics, but those new writes are dirtying a
> version of the bitmap not intended to be exposed via the NBD channel.
> NBD gets effectively a snapshot of both the bitmap AND the data.)

Thanks. That makes sense - or enough sense for me to carry on commenting!

>> I now think what you are talking about backing up a *snapshot* of a disk
>> that's running, where the disk itself was not connected using NBD? IE it's
>> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is 
>> effectively
>> an opaque state represented in a bitmap, which is binary metadata
>> at some particular level of granularity. It might as well be 'happiness'
>> or 'is coloured blue'. The NBD server would (normally) have no way of
>> manipulating this bitmap.
>> 
>> In previous comments, I said 'how come we can set the dirty bit through
>> writes but can't clear it?'. This (my statement) is now I think wrong,
>> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
>> state of the bitmap comes from whatever sets the bitmap which is outside
>> the scope of this protocol to transmit it.
>> 
> 
> You know, this is a fair point. We have not (to my knowledge) yet
> carefully considered the exact bitmap management scenario when NBD is
> involved in retrieving dirty blocks.
> 
> Humor me for a moment while I talk about a (completely hypothetical, not
> yet fully discussed) workflow for how I envision this feature.
> 
> (1) User sets up a drive in QEMU, a bitmap is initialized, an initial
> backup is made, etc.
> 
> (2) As writes come in, QEMU's bitmap is dirtied.
> 
> (3) The user decides they want to root around to see what data has
> changed and would like to use NBD to do so, in contrast to QEMU's own
> facilities for dumping dirty blocks.
> 
> (4) A command is issued that creates a temporary, lightweight snapshot
> ('fleecing') and exports this snapshot over NBD. The bitmap is
> associated with the NBD export at this point at NBD server startup. (For
> the sake of QEMU discussion, maybe this command is "blockdev-fleece")
> 
> (5) At this moment, the snapshot is static and represents the data at
> the time the NBD server was started. The bitmap is also forked and
> represents only this snapshot. The live data and bitmap continue to change.
> 
> (6) Dirty blocks are queried and copied out via NBD.
> 
> (7) The user closes the NBD instance upon completion of their task,
> whatever it was. (Making a new incremental backup? Just taking a peek at
> some changed data? who knows.)
> 
> The point that's interesting here is what do we do with the two bitmaps
> at this point? The data delta can be discarded (this was after all just
> a lightweight read-only point-in-time snapshot) but the bitmap data
> needs to be dealt with.
> 
> (A) In the case of "User made a new incremental backup," the bitmap that
> got forked off to serve the NBD read should be discarded.
> 
> (B) In the case of "User just wanted to look around," the bitmap should
> be merged back into the bitmap it was forked from.
> 
> I don't advise a hybrid where "User copied some data, but not all" where
> we need to partially clear *and* merge, but conceivably this could
> happen, because the things we don't want to happen always will.
> 
> At this point maybe it's becoming obvious that actually it would be very
> prudent to allow the NBD client itself to inform QEMU via the NBD
> protocol which extents/blocks/(etc) that it is "done" with.
> 
> Maybe it *would* actually be useful if, in NBD allowing us to add a
> "dirty" bit to the specification, we allow users to clear those bits.
> 
> Then, whether the user was trying to do (A) or (B) or the unspeakable
> amalgamation of both things, it's up to the user to clear the bits
> desired and QEMU can do the simple task of simply always merging the
> bitmap fork upon the conclusion of the NBD fleecing exercise.
> 
> Maybe this would allow the dirty bit to have a bit more 

Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-02 Thread Vladimir Sementsov-Ogievskiy

02.12.2016 02:42, John Snow wrote:

(B) In the case of "User just wanted to look around," the bitmap should
be merged back into the bitmap it was forked from.

currently existing example: "failed incremental backup"


--
Best regards,
Vladimir




Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-01 Thread John Snow
Hi Alex, let me try my hand at clarifying some points...

On 11/29/2016 07:57 AM, Alex Bligh wrote:
> Vladimir,
> 
> I went back to April to reread the previous train of conversation
> then found you had helpfully summarised some if it. Comments
> below.
> 
> Rather than comment on many of the points individual, the root
> of my confusion and to some extent uncomfortableness about this
> proposal is 'who owns the meaning the the bitmaps'.
> 
> Some of this is my own confusion (sorry) about the use to which
> this is being put, which is I think at root a documentation issue.
> To illustrate this, you write in the FAQ section that this is for
> read only disks, but the text talks about:
> 

I sent an earlier email in response to Wouter over our exact "goal" with
this spec extension that is a little more of an overview, but I will try
to address your questions specifically.

> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
> 
> How can data be 'dirty' if it is static and unchangeable? (I thought)
> 

In a simple case, live IO goes to e.g. hda.qcow2. These writes come from
the VM and cause the bitmap that QEMU manages to become dirty.

We intend to expose the ability to fleece dirty blocks via NBD. What
happens in this scenario would be that a snapshot of the data at the
time of the request is exported over NBD in a read-only manner.

In this way, the drive itself is R/W, but the "view" of it from NBD is
RO. While a hypothetical backup client is busy copying data out of this
temporary view, new writes are coming in to the drive, but are not being
exposed through the NBD export.

(This goes into QEMU-specifics, but those new writes are dirtying a
version of the bitmap not intended to be exposed via the NBD channel.
NBD gets effectively a snapshot of both the bitmap AND the data.)

> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.
> 
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.
> 

You know, this is a fair point. We have not (to my knowledge) yet
carefully considered the exact bitmap management scenario when NBD is
involved in retrieving dirty blocks.

Humor me for a moment while I talk about a (completely hypothetical, not
yet fully discussed) workflow for how I envision this feature.

(1) User sets up a drive in QEMU, a bitmap is initialized, an initial
backup is made, etc.

(2) As writes come in, QEMU's bitmap is dirtied.

(3) The user decides they want to root around to see what data has
changed and would like to use NBD to do so, in contrast to QEMU's own
facilities for dumping dirty blocks.

(4) A command is issued that creates a temporary, lightweight snapshot
('fleecing') and exports this snapshot over NBD. The bitmap is
associated with the NBD export at this point at NBD server startup. (For
the sake of QEMU discussion, maybe this command is "blockdev-fleece")

(5) At this moment, the snapshot is static and represents the data at
the time the NBD server was started. The bitmap is also forked and
represents only this snapshot. The live data and bitmap continue to change.

(6) Dirty blocks are queried and copied out via NBD.

(7) The user closes the NBD instance upon completion of their task,
whatever it was. (Making a new incremental backup? Just taking a peek at
some changed data? who knows.)

The point that's interesting here is what do we do with the two bitmaps
at this point? The data delta can be discarded (this was after all just
a lightweight read-only point-in-time snapshot) but the bitmap data
needs to be dealt with.

(A) In the case of "User made a new incremental backup," the bitmap that
got forked off to serve the NBD read should be discarded.

(B) In the case of "User just wanted to look around," the bitmap should
be merged back into the bitmap it was forked from.

I don't advise a hybrid where "User copied some data, but not all" where
we need to partially clear *and* merge, but conceivably this could
happen, because the things we don't want to happen always will.

At this point maybe it's becoming 

Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Vladimir Sementsov-Ogievskiy

29.11.2016 17:52, Alex Bligh wrote:

Vladimir,


4. Q: Should not get_{allocated,dirty} be separate commands?
   cons: Two commands with almost same semantic and similar means?
   pros: However here is a good point of separating clearly defined and native
 for block devices GET_BLOCK_STATUS from user-driven and actually
 undefined data, called 'dirtyness'.

I'm suggesting one generic 'read bitmap' command like you.

To support get_block_status in this general read_bitmap, we will need to define 
something like 'multibitmap', which allows several bits per chunk, as 
allocation data has two: zero and allocated.

I think you are saying that for arbitrary 'bitmap' there might be more than one 
state. For instance, one might (in an allocation 'bitmap') have a hole, a 
non-hole-zero, or a non-hole-non-zero.

In the spec I'd suggest, for one 'bitmap', we represent the output as extents. 
Each extent has a status. For the bitmap to be useful, at least two status need 
to be possible, but the above would have three. This could be internally 
implemented by the server as (a) a bitmap (with two bits per entry), (b) two 
bitmaps (possibly with different granularity), (c) something else (e.g. reading 
file extents, then if the data is allocated manually comparing it against zero).

I should have put 'bitmap' in quotes in what I wrote because returning extents 
(as you suggested) is a good idea, and there need not be an actual bitmap.


5. Number of status descriptors, sent by server, should be restricted
   variants:
   1: just allow server to restrict this as it wants (which was done in v3)
   2: (not excluding 1). Client specifies somehow the maximum for number
  of descriptors.
  2.1: add command flag, which will request only one descriptor
   (otherwise, no restrictions from the client)
  2.2: again, introduce extended nbd requests, and add field to
   specify this maximum

I think some form of extended request is the way to go, but out of
interest, what's the issue with as many descriptors being sent as it
takes to encode the reply? The client can just consume the remainder
(without buffering) and reissue the request at a later point for
the areas it discarded.

the issue is: too many descriptors possible. So, (1) solves it. (2) is 
optional, just to simplify/optimize client side.

I think I'd prefer the server to return what it was asked for, and the client 
to deal with it. So either the client should be able to specify a maximum 
number of extents (and if we are extending the command structure, that's 
possible) or we deal with the client consuming and retrying unwanted extents. 
The reason for this is that it's unlikely the server can know a priori the 
number of extents which is the appropriate maximum for the client.


+The list of block status descriptors within the
+`NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
+of the file starting from specified *offset*, and the sum of the
+*length* fields of each descriptor MUST not be greater than the
+overall *length* of the request. This means that the server MAY
+return less data than required. However the server MUST return at
+least one status descriptor

I'm not sure I understand why that's useful. What should the client
infer from the server refusing to provide information? We don't
permit short reads etc.

if the bitmap is 010101010101 we will have too many descriptors. For example, 16tb 
disk, 64k granularity -> 2G of descriptors payload.

Yep. And the cost of consuming and retrying is quite high. One option would be 
for the client to realise this is a possibility, and not request the entire 
extent map for a 16TB disk, as it might be very large! Even if the client 
worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per 
reply), this isn't going to noticeably increase the round trip timing. One 
issue here is that to determine a 'reasonable' size, the client needs to know 
the minimum length of any extent.


and with this approach we will in turn have overhead of too many 
requests for  or  bitmaps.




I think the answer is probably a 'maximum number of extents' in the request 
packet.

Of course with statuses in extent, the final extent could be represented as 'I 
don't know, break this bit into a separate request' status.



With such predefined status, we can postpone creating extended requests, 
have number of extents restricted by server and have sum of extents 
lengths be equal to request length.



--
Best regards,
Vladimir




Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh
Vladimir,

>>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>>   cons: Two commands with almost same semantic and similar means?
>>>   pros: However here is a good point of separating clearly defined and 
>>> native
>>> for block devices GET_BLOCK_STATUS from user-driven and actually
>>> undefined data, called 'dirtyness'.
>> I'm suggesting one generic 'read bitmap' command like you.
> 
> To support get_block_status in this general read_bitmap, we will need to 
> define something like 'multibitmap', which allows several bits per chunk, as 
> allocation data has two: zero and allocated.

I think you are saying that for arbitrary 'bitmap' there might be more than one 
state. For instance, one might (in an allocation 'bitmap') have a hole, a 
non-hole-zero, or a non-hole-non-zero.

In the spec I'd suggest, for one 'bitmap', we represent the output as extents. 
Each extent has a status. For the bitmap to be useful, at least two status need 
to be possible, but the above would have three. This could be internally 
implemented by the server as (a) a bitmap (with two bits per entry), (b) two 
bitmaps (possibly with different granularity), (c) something else (e.g. reading 
file extents, then if the data is allocated manually comparing it against zero).

I should have put 'bitmap' in quotes in what I wrote because returning extents 
(as you suggested) is a good idea, and there need not be an actual bitmap.

>>> 5. Number of status descriptors, sent by server, should be restricted
>>>   variants:
>>>   1: just allow server to restrict this as it wants (which was done in v3)
>>>   2: (not excluding 1). Client specifies somehow the maximum for number
>>>  of descriptors.
>>>  2.1: add command flag, which will request only one descriptor
>>>   (otherwise, no restrictions from the client)
>>>  2.2: again, introduce extended nbd requests, and add field to
>>>   specify this maximum
>> I think some form of extended request is the way to go, but out of
>> interest, what's the issue with as many descriptors being sent as it
>> takes to encode the reply? The client can just consume the remainder
>> (without buffering) and reissue the request at a later point for
>> the areas it discarded.
> 
> the issue is: too many descriptors possible. So, (1) solves it. (2) is 
> optional, just to simplify/optimize client side.

I think I'd prefer the server to return what it was asked for, and the client 
to deal with it. So either the client should be able to specify a maximum 
number of extents (and if we are extending the command structure, that's 
possible) or we deal with the client consuming and retrying unwanted extents. 
The reason for this is that it's unlikely the server can know a priori the 
number of extents which is the appropriate maximum for the client.

>>> +The list of block status descriptors within the
>>> +`NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>>> +of the file starting from specified *offset*, and the sum of the
>>> +*length* fields of each descriptor MUST not be greater than the
>>> +overall *length* of the request. This means that the server MAY
>>> +return less data than required. However the server MUST return at
>>> +least one status descriptor
>> I'm not sure I understand why that's useful. What should the client
>> infer from the server refusing to provide information? We don't
>> permit short reads etc.
> 
> if the bitmap is 010101010101 we will have too many descriptors. For example, 
> 16tb disk, 64k granularity -> 2G of descriptors payload.

Yep. And the cost of consuming and retrying is quite high. One option would be 
for the client to realise this is a possibility, and not request the entire 
extent map for a 16TB disk, as it might be very large! Even if the client 
worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per 
reply), this isn't going to noticeably increase the round trip timing. One 
issue here is that to determine a 'reasonable' size, the client needs to know 
the minimum length of any extent.

I think the answer is probably a 'maximum number of extents' in the request 
packet.

Of course with statuses in extent, the final extent could be represented as 'I 
don't know, break this bit into a separate request' status.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Vladimir Sementsov-Ogievskiy

29.11.2016 15:57, Alex Bligh wrote:

Vladimir,

I went back to April to reread the previous train of conversation
then found you had helpfully summarised some if it. Comments
below.

Rather than comment on many of the points individual, the root
of my confusion and to some extent uncomfortableness about this
proposal is 'who owns the meaning the the bitmaps'.

Some of this is my own confusion (sorry) about the use to which
this is being put, which is I think at root a documentation issue.
To illustrate this, you write in the FAQ section that this is for
read only disks, but the text talks about:

+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.

How can data be 'dirty' if it is static and unchangeable? (I thought)

I now think what you are talking about backing up a *snapshot* of a disk
that's running, where the disk itself was not connected using NBD? IE it's
not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
an opaque state represented in a bitmap, which is binary metadata
at some particular level of granularity. It might as well be 'happiness'
or 'is coloured blue'. The NBD server would (normally) have no way of
manipulating this bitmap.


Yes, something like this.

Note: in Qemu it may not be a snapshot (actually, I didn't see a way in 
Qemu to export snapshots not switching to them (except opening external 
snapshot as a separate block dev)), but just any read-only drive, or 
temporary drive, used in the image fleecing scheme:


driveA is online normal drive
driveB is empty nbd export

- start backup driveA->driveB with sync=none, which means that the only 
copy which is done is copying old data from A to B before every write to A
- and set driveA as backing for driveB (on read from B, if data is not 
allocated it will be read from A)


after that, driveB is something like a snapshot for backup through NBD.

It's all just to say: calling backup-point-in-time-state just a 
'snapshot' confuses me and I hope not to see this word in the spec (in 
this context).




In previous comments, I said 'how come we can set the dirty bit through
writes but can't clear it?'. This (my statement) is now I think wrong,
as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
state of the bitmap comes from whatever sets the bitmap which is outside
the scope of this protocol to transmit it.

However, we have the uncomfortable (to me) situation where the protocol
describes a flag 'dirty', with implications as to what it does, but
no actual strict definition of how it's set. So any 'other' user has
no real idea how to use the information, or how to implement a server
that provides a 'dirty' bit, because the semantics of that aren't within
the protocol. This does not sit happily with me.

So I'm wondering whether we should simplify and generalise this spec. You
say that for the dirty flag, there's no specification of when it is
set and cleared - that's implementation defined. Would it not be better
then to say 'that whole thing is private to Qemu - even the name'.

Rather you could read the list of bitmaps a server has, with a textual
name, each having an index (and perhaps a granularity). You could then
ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
and some (e.g. ones beginning with 'X-') could be reserved for user
usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
need a protocol change to support it.

IE rather than looking at 'a way of reading the dirty bit', we could
have this as a generic way of reading opaque bitmaps. Only one (allocation)
might be given meaning to start off with, and it wouldn't be necessary
for all servers to support that - i.e. you could support bitmap reading
without having an ALLOCATION bitmap available.

This spec would then be limited to the transmission of the bitmaps
(remove the word 'dirty' entirely, except perhaps as an illustrative
use case), and include only the definition of the allocation bitmap.


Good point. For Qcow2 we finally come to just bitmaps, not "dirty 
bitmaps", to make it more general. There is a problem with allocation if 
we want to make it a subcase of bitmap: allocation natively have two 
bits per block: zero and allocated. We can of course separate this into 
two bitmaps, but this will not be similar with classic get_block_status.



Some more nits:


Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
NBD_FLAG_CAN_MULTI_CONN in master branch.

And, finally, I've rebased this onto current state of
extension-structured-reply branch (which itself should be 

Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh
Vladimir,

I went back to April to reread the previous train of conversation
then found you had helpfully summarised some if it. Comments
below.

Rather than comment on many of the points individual, the root
of my confusion and to some extent uncomfortableness about this
proposal is 'who owns the meaning the the bitmaps'.

Some of this is my own confusion (sorry) about the use to which
this is being put, which is I think at root a documentation issue.
To illustrate this, you write in the FAQ section that this is for
read only disks, but the text talks about:

+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.

How can data be 'dirty' if it is static and unchangeable? (I thought)

I now think what you are talking about backing up a *snapshot* of a disk
that's running, where the disk itself was not connected using NBD? IE it's
not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
an opaque state represented in a bitmap, which is binary metadata
at some particular level of granularity. It might as well be 'happiness'
or 'is coloured blue'. The NBD server would (normally) have no way of
manipulating this bitmap.

In previous comments, I said 'how come we can set the dirty bit through
writes but can't clear it?'. This (my statement) is now I think wrong,
as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
state of the bitmap comes from whatever sets the bitmap which is outside
the scope of this protocol to transmit it.

However, we have the uncomfortable (to me) situation where the protocol
describes a flag 'dirty', with implications as to what it does, but
no actual strict definition of how it's set. So any 'other' user has
no real idea how to use the information, or how to implement a server
that provides a 'dirty' bit, because the semantics of that aren't within
the protocol. This does not sit happily with me.

So I'm wondering whether we should simplify and generalise this spec. You
say that for the dirty flag, there's no specification of when it is
set and cleared - that's implementation defined. Would it not be better
then to say 'that whole thing is private to Qemu - even the name'.

Rather you could read the list of bitmaps a server has, with a textual
name, each having an index (and perhaps a granularity). You could then
ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
and some (e.g. ones beginning with 'X-') could be reserved for user
usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
need a protocol change to support it.

IE rather than looking at 'a way of reading the dirty bit', we could
have this as a generic way of reading opaque bitmaps. Only one (allocation)
might be given meaning to start off with, and it wouldn't be necessary
for all servers to support that - i.e. you could support bitmap reading
without having an ALLOCATION bitmap available.

This spec would then be limited to the transmission of the bitmaps
(remove the word 'dirty' entirely, except perhaps as an illustrative
use case), and include only the definition of the allocation bitmap.

Some more nits:

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
> NBD_FLAG_CAN_MULTI_CONN in master branch.
> 
> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).

Each documentation branch should normally be branched off master unless
it depends on another extension (in which case it will be branched from that).
I haven't been rebasing them frequently as it can disrupt those working
on the branches. There's only really an issue around rebasing where you
depend on another branch.


> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>   A: 1: server replies with status descriptors of any size, granularity
> is hidden from the client
>  2: dirty/allocated requests are separate and unrelated to each
> other, so their granularities are not intersecting

I'm OK with this, but note that you do actually mention a granularity
of sorts in the spec (512 byes) - I think you should replace that
with the minimum block size.

> 3. Q: selecting of dirty bitmap to export
>   A: several variants:
>  1: id of bitmap is in flags field of request
>  pros: - simple
>  cons: - it's a hack. flags field is for other uses.
>- we'll have to map bitmap names to these "ids"
>  2: introduce extended nbd requests with variable length and exploit this
> feature for BLOCK_STATUS command, 

Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-25 Thread Stefan Hajnoczi
On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov 
> CC: Denis V. Lunev 
> CC: Wouter Verhelst 
> CC: Paolo Bonzini 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> Signed-off-by: Eric Blake 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
> removing the restriction, that sum of status descriptor lengths must be equal
> to requested length. I.e., let's permit the server to replay with less data
> than required if it wants.
> 
> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>  NBD_FLAG_CAN_MULTI_CONN in master branch.
> 
> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).
> 
> By this resend I just want to continue the diqussion, started about half
> a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>A: This all is for read-only disks, so the data is static and unchangeable.
> 
> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>A: 1: server replies with status descriptors of any size, granularity
>  is hidden from the client
>   2: dirty/allocated requests are separate and unrelated to each
>  other, so their granularities are not intersecting
> 
> 3. Q: selecting of dirty bitmap to export
>A: several variants:
>   1: id of bitmap is in flags field of request
>   pros: - simple
>   cons: - it's a hack. flags field is for other uses.
> - we'll have to map bitmap names to these "ids"
>   2: introduce extended nbd requests with variable length and exploit this
>  feature for BLOCK_STATUS command, specifying bitmap identifier.
>  pros: - looks like a true way
>  cons: - we have to create additional extension
>- possible we have to create a map,
>  { <=> }
>   3: exteranl tool should select which bitmap to export. So, in case of 
> Qemu
>  it should be something like qmp command block-export-dirty-bitmap.
>  pros: - simple
>- we can extend it to behave like (2) later
>  cons: - additional qmp command to implement (possibly, the lesser 
> evil)
>  note: Hmm, external tool can make chose between allocated/dirty data 
> too,
>so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> 4. Q: Should not get_{allocated,dirty} be separate commands?
>cons: Two commands with almost same semantic and similar means?
>pros: However here is a good point of separating clearly defined and native
>  for block devices GET_BLOCK_STATUS from user-driven and actually
>  undefined data, called 'dirtyness'.
> 
> 5. Number of status descriptors, sent by server, should be restricted
>variants:
>1: just allow server to restrict this as it wants (which was done in v3)
>2: (not excluding 1). Client specifies somehow the maximum for number
>   of descriptors.
>   2.1: add command flag, which will request only one descriptor
>(otherwise, no restrictions from the client)
>   2.2: again, introduce extended nbd requests, and add field to
>specify this maximum
> 
> 6. A: What to do with unspecified flags (in request/reply)?
>I think the normal variant is to make them reserved. (Server should
>return EINVAL if found unknown bits, client should consider replay
>with unknown