Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-02 Thread Wouter Verhelst
Hi Vladimir,

On Thu, Dec 01, 2016 at 02:26:28PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 01.12.2016 13:14, Wouter Verhelst wrote:
[...]
> > -- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
> > +- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
> > selected by the query string is added to the list of existing
> > -  allocation contexts.
> 
> If I understand correctly, it should be not 'existing', but 'exporting'. 
> So there are several contexts, server knows about. They are definitely 
> exists. Some of them may be selected (by client) for export (to client, 
> through get_block_status).
> 
> so, what about 'list of metadata contexts to export' or something like this?

Yes, good idea. Thanks.

[...]
> > +The "BASE:allocation" metadata context is the basic "exists at all"
> > +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
> > +context, this means that the given extent is not allocated in the
> > +backend storage, and that writing to the extent MAY result in the ENOSPC
> > +error. This supports sparse file semantics on the server side. If a
> > +server has only one metadata context (the default), then writing to an
> > +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
> 
> this dependence looks strange. user defined metadata, why it affects 
> allocation?

The reference nbd-server implementation (i.e., the one that accompanies
this document) has a "copy-on-write" mode in which modifications are
written to a separate file. This separate file can easily be a sparse
file.

Let's say you have an export which is fully allocated; the
BASE:allocation state would have STATE_HOLE cleared. However, when
writing to a particular block that has not yet been written to during
that session, the copy-on-write mode would have to allocate a new block
in the copy-on-write file, which may fail with ENOSPC.

Perhaps the above paragraph could be updated in the sense that it SHOULD
NOT fail in other cases, but that it may depend on the semantics of the
other metadata contexts, whether active (i.e., selected with
NBD_OPT_META_CONTEXT) or not.

Side note: that does mean that some metadata contexts may be specific to
particular exports (e.g., you may have a list of named snapshots to
select, and which ones would exist would depend on the specific export
chosen). I guess that means the metadata contexts should have the name
of the export, too.

> At least, 'only one' is not descriptive, it would be better to mention
> 'BASE:allocation' name.

Yes, that would make things clearer, indeed.

> (I hope, I can ask server to export only dirty_bitmap context, not
> exporting allocation?

That was the point of naming that context and making it selectable :-)

> In that case this 'only one' would be dirty_bitmap)
> 
> >   For all other cases, this specification requires no specific semantics
> 
> what are 'other cases'? For all other metadata contexts? Or for all 
> cases when we have more than one context?

Other metadata contexts. I'll rephrease it in that sense.

> > -of allocation contexts. Implementations could support allocation
> > +of metadata contexts. Implementations could support metadata
> >   contexts with semantics like the following:
> >   
> > -- Incremental snapshots; if a block is allocated in one allocation
> > +- Incremental snapshots; if a block is allocated in one metadata
> > context, that implies that it is also allocated in the next level up.
> >   - Various bits of data about the backend of the storage; e.g., if the
> > -  storage is written on a RAID array, an allocation context could
> > +  storage is written on a RAID array, a metadata context could
> > return information about the redundancy level of a given extent
> >   - If the backend implements a write-through cache of some sort, or
> > -  synchronises with other servers, an allocation context could state
> > -  that an extent is "allocated" once it has reached permanent storage
> > +  synchronises with other servers, a metadata context could state
> > +  that an extent is "active" once it has reached permanent storage
> > and/or is synchronized with other servers.
> 
> Incremental snapshots sounds strange for me. Snapshots are just 
> snapshots.. Backup may be incremental, but it is not about snapshots.. I 
> think this example may be safely deleted from the spec.

Yeah, I'll just remove all the examples; they're not really critical,
anyway, and might indeed confuse people.

[...]
> >   Servers MUST return an `NBD_REPLY_TYPE_BLOCK_STATUS` chunk for every
> > -allocation context ID, except if the semantics of particular
> > -allocation contexts mean that the information for one allocation
> > -context is implied by the information for another.
> > +metadata context ID, except if the semantics of particular
> > +metadata contexts mean that the information for one active metadata
> > +context is implied by the information for anothe

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-01 Thread Vladimir Sementsov-Ogievskiy
01.12.2016 13:14, Wouter Verhelst wrote:
> Here's another update.
>
> Changes since previous version:
> - Rename "allocation context" to "metadata context"
> - Stop making metadata context 0 be special; instead, name it
>"BASE:allocation" and allow it to be selected like all other contexts.
> - Clarify in a bit more detail when a server MAY omit metadata
>information on a particular metadata context (i.e., only if another
>metadata context that we actually got information for implies it can't
>have meaning). This was always meant that way, but the spec could have
>been a bit more explicit about it.
> - Change one SHOULD to a MUST, where it should not have been a SHOULD
>in the first place.
>
> (This applies on top of my previous patch)
>
> Applied and visible at the usual place.
>
> diff --git a/doc/proto.md b/doc/proto.md
> index fe7ae53..9c0981f 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -869,16 +869,16 @@ of the newstyle negotiation.
>   
>   Defined by the experimental `INFO` 
> [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
>   
> -- `NBD_OPT_ALLOC_CONTEXT` (10)
> +- `NBD_OPT_META_CONTEXT` (10)
>   
> -Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
> +Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
>   followed by an `NBD_REP_ACK`. If a server replies to such a request
>   with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
>   commands during the transmission phase.
>   
>   If the query string is syntactically invalid, the server SHOULD send
>   `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
> -but finds no allocation contexts, the server MUST send a single
> +but finds no metadata contexts, the server MUST send a single
>   reply of type `NBD_REP_ACK`.
>   
>   This option MUST NOT be requested unless structured replies have
> @@ -887,9 +887,9 @@ of the newstyle negotiation.
>   
>   Data:
>   - 32 bits, type
> -- String, query to select a subset of the available allocation
> +- String, query to select a subset of the available metadata
> contexts. If this is not specified (i.e., length is 4 and no
> -  command is sent), then the server MUST send all the allocation
> +  command is sent), then the server MUST send all the metadata
> contexts it knows about. If specified, this query string MUST
> start with a name that uniquely identifies a server
> implementation; e.g., the reference implementation that
> @@ -897,18 +897,22 @@ of the newstyle negotiation.
> with 'nbd-server:'
>   
>   The type may be one of:
> -- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
> +- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
> selected by the query string is returned to the client without
> -  changing any state (i.e., this does not add allocation contexts
> +  changing any state (i.e., this does not add metadata contexts
> for further usage).
> -- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
> +- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
> selected by the query string is added to the list of existing

If I understand correctly, it should be not 'existing', but 'exporting'. 
So there are several contexts, server knows about. They are definitely 
exists. Some of them may be selected (by client) for export (to client, 
through get_block_status).

so, what about 'list of metadata contexts to export' or something like this?

> -  allocation contexts.
> -- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
> +  metadata contexts.
> +- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
> selected by the query string is removed from the list of used
> -  allocation contexts. Servers SHOULD NOT reuse existing allocation
> +  metadata contexts. Servers SHOULD NOT reuse existing metadata
> context IDs.
>   
> +The syntax of the query string is not specified, except that
> +implementations MUST support adding and removing individual metadata
> +contexts by simply listing their names.
> +
>    Option reply types
>   
>   These values are used in the "reply type" field, sent by the server
> @@ -920,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
>   information is available, or when sending data related to the option
>   (in the case of `NBD_OPT_LIST`) has finished. No data.
>   
> -* `NBD_REP_SERVER` (2)
> +- `NBD_REP_SERVER` (2)
>   
>   A description of an export. Data:
>   
> @@ -935,21 +939,20 @@ during option haggling in the fixed newstyle 
> negotiation.
> particular client request, this field is defined to be a string
> suitable for direct display to a human being.
>   
> -* `NBD_REP_INFO` (3)
> +- `NBD_REP_INFO` (3)
>   
>   Defined b

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-01 Thread Wouter Verhelst
Here's another update.

Changes since previous version:
- Rename "allocation context" to "metadata context"
- Stop making metadata context 0 be special; instead, name it
  "BASE:allocation" and allow it to be selected like all other contexts.
- Clarify in a bit more detail when a server MAY omit metadata
  information on a particular metadata context (i.e., only if another
  metadata context that we actually got information for implies it can't
  have meaning). This was always meant that way, but the spec could have
  been a bit more explicit about it.
- Change one SHOULD to a MUST, where it should not have been a SHOULD
  in the first place.

(This applies on top of my previous patch)

Applied and visible at the usual place.

diff --git a/doc/proto.md b/doc/proto.md
index fe7ae53..9c0981f 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -869,16 +869,16 @@ of the newstyle negotiation.
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-- `NBD_OPT_ALLOC_CONTEXT` (10)
+- `NBD_OPT_META_CONTEXT` (10)
 
-Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
+Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
 followed by an `NBD_REP_ACK`. If a server replies to such a request
 with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
 commands during the transmission phase.
 
 If the query string is syntactically invalid, the server SHOULD send
 `NBD_REP_ERR_INVALID`. If the query string is syntactically valid
-but finds no allocation contexts, the server MUST send a single
+but finds no metadata contexts, the server MUST send a single
 reply of type `NBD_REP_ACK`.
 
 This option MUST NOT be requested unless structured replies have
@@ -887,9 +887,9 @@ of the newstyle negotiation.
 
 Data:
 - 32 bits, type
-- String, query to select a subset of the available allocation
+- String, query to select a subset of the available metadata
   contexts. If this is not specified (i.e., length is 4 and no
-  command is sent), then the server MUST send all the allocation
+  command is sent), then the server MUST send all the metadata
   contexts it knows about. If specified, this query string MUST
   start with a name that uniquely identifies a server
   implementation; e.g., the reference implementation that
@@ -897,18 +897,22 @@ of the newstyle negotiation.
   with 'nbd-server:'
 
 The type may be one of:
-- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
   selected by the query string is returned to the client without
-  changing any state (i.e., this does not add allocation contexts
+  changing any state (i.e., this does not add metadata contexts
   for further usage).
-- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
   selected by the query string is added to the list of existing
-  allocation contexts.
-- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+  metadata contexts.
+- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
   selected by the query string is removed from the list of used
-  allocation contexts. Servers SHOULD NOT reuse existing allocation
+  metadata contexts. Servers SHOULD NOT reuse existing metadata
   context IDs.
 
+The syntax of the query string is not specified, except that
+implementations MUST support adding and removing individual metadata
+contexts by simply listing their names.
+
  Option reply types
 
 These values are used in the "reply type" field, sent by the server
@@ -920,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
 information is available, or when sending data related to the option
 (in the case of `NBD_OPT_LIST`) has finished. No data.
 
-* `NBD_REP_SERVER` (2)
+- `NBD_REP_SERVER` (2)
 
 A description of an export. Data:
 
@@ -935,21 +939,20 @@ during option haggling in the fixed newstyle negotiation.
   particular client request, this field is defined to be a string
   suitable for direct display to a human being.
 
-* `NBD_REP_INFO` (3)
+- `NBD_REP_INFO` (3)
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
-- `NBD_REP_ALLOC_CONTEXT` (4)
+- `NBD_REP_META_CONTEXT` (4)
 
-A description of an allocation context. Data:
+A description of a metadata context. Data:
 
-- 32 bits, NBD allocation context ID. If the request was NOT of type
-  `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
-- String, name of the allocation context. This is not required to be
+- 32 bits, NBD metadata context ID.
+- String, name of the metadata context. This is not required to be
   a hum

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-30 Thread Sergey Talantov
Hi, Wouter!

> Actually, come to think of that. What is the exact use case for this thing? I 
> understand you're trying to create incremental backups of things, which would 
> imply you don't write from the client that is getting the ?
> block status thingies, right?

Overall, the most desired use case for this NBD extension is to allow 3-rd 
party software to make incremental backups.
Acronis (vendor of backup solutions) would support qemu backup if block status 
is provided.


-Original Message-
From: Wouter Verhelst [mailto:w...@uter.be] 
Sent: Sunday, November 27, 2016 22:17
To: Vladimir Sementsov-Ogievskiy 
Cc: nbd-general@lists.sourceforge.net; kw...@redhat.com; qemu-de...@nongnu.org; 
pborzen...@virtuozzo.com; stefa...@redhat.com; pbonz...@redhat.com; 
m...@pengutronix.de; d...@openvz.org
Subject: Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if they're 
useful, but I would prefer that they're also useful with NBD on its own, i.e., 
without requiring an initiation or correlation of some state through another 
protocol or network connection or whatever. If that's needed, that feels like I 
didn't do my job properly, if you get my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed to 
> query status of a particular range and read only those blocks of data 
> that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS extension 
> with one new NBD_CMD_BLOCK_STATUS command, a new structured reply 
> chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required during, 
> for example, incremental block device backup. To express this concept 
> via NBD protocol, this patch also adds a flag to NBD_CMD_BLOCK_STATUS 
> to request dirtiness information rather than provisioning information; 
> however, with the current proposal, data dirtiness is only useful with 
> additional coordination outside of the NBD protocol (such as a way to 
> start and stop the server from tracking dirty sectors).  Future NBD 
> extensions may add commands to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET 
> LBA STATUS" command more closely, it has been chosen to return a list 
> of extents in the response of NBD_CMD_BLOCK_STATUS command, instead of 
> a bitmap.
> 
> CC: Pavel Borzenkov 
> CC: Denis V. Lunev 
> CC: Wouter Verhelst 
> CC: Paolo Bonzini 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> Signed-off-by: Eric Blake 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change 
> is removing the restriction, that sum of status descriptor lengths 
> must be equal to requested length. I.e., let's permit the server to 
> replay with less data than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now  
> NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

> And, finally, I've rebased this onto current state of 
> extension-structured-reply branch (which itself should be rebased on 
> master IMHO).

Probably a good idea, given the above.

> By this resend I just want to continue the diqussion, started about 
> half a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other writes 
happen without its knowledge. This may be because the client and server 
communicate out of band about state changes, or because the client somehow 
knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only talks 
about *current* state, and that concurrent writes by different clients may 
invalidate the state. This is true for NBD in general (i.e., concurrent r

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Vladimir Sementsov-Ogievskiy
29.11.2016 17:52, Alex Bligh wrote:
> Vladimir,
>
 4. Q: Should not get_{allocated,dirty} be separate commands?
cons: Two commands with almost same semantic and similar means?
pros: However here is a good point of separating clearly defined and 
 native
  for block devices GET_BLOCK_STATUS from user-driven and actually
  undefined data, called 'dirtyness'.
>>> I'm suggesting one generic 'read bitmap' command like you.
>> To support get_block_status in this general read_bitmap, we will need to 
>> define something like 'multibitmap', which allows several bits per chunk, as 
>> allocation data has two: zero and allocated.
> I think you are saying that for arbitrary 'bitmap' there might be more than 
> one state. For instance, one might (in an allocation 'bitmap') have a hole, a 
> non-hole-zero, or a non-hole-non-zero.
>
> In the spec I'd suggest, for one 'bitmap', we represent the output as 
> extents. Each extent has a status. For the bitmap to be useful, at least two 
> status need to be possible, but the above would have three. This could be 
> internally implemented by the server as (a) a bitmap (with two bits per 
> entry), (b) two bitmaps (possibly with different granularity), (c) something 
> else (e.g. reading file extents, then if the data is allocated manually 
> comparing it against zero).
>
> I should have put 'bitmap' in quotes in what I wrote because returning 
> extents (as you suggested) is a good idea, and there need not be an actual 
> bitmap.
>
 5. Number of status descriptors, sent by server, should be restricted
variants:
1: just allow server to restrict this as it wants (which was done in v3)
2: (not excluding 1). Client specifies somehow the maximum for number
   of descriptors.
   2.1: add command flag, which will request only one descriptor
(otherwise, no restrictions from the client)
   2.2: again, introduce extended nbd requests, and add field to
specify this maximum
>>> I think some form of extended request is the way to go, but out of
>>> interest, what's the issue with as many descriptors being sent as it
>>> takes to encode the reply? The client can just consume the remainder
>>> (without buffering) and reissue the request at a later point for
>>> the areas it discarded.
>> the issue is: too many descriptors possible. So, (1) solves it. (2) is 
>> optional, just to simplify/optimize client side.
> I think I'd prefer the server to return what it was asked for, and the client 
> to deal with it. So either the client should be able to specify a maximum 
> number of extents (and if we are extending the command structure, that's 
> possible) or we deal with the client consuming and retrying unwanted extents. 
> The reason for this is that it's unlikely the server can know a priori the 
> number of extents which is the appropriate maximum for the client.
>
 +The list of block status descriptors within the
 +`NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
 +of the file starting from specified *offset*, and the sum of the
 +*length* fields of each descriptor MUST not be greater than the
 +overall *length* of the request. This means that the server MAY
 +return less data than required. However the server MUST return at
 +least one status descriptor
>>> I'm not sure I understand why that's useful. What should the client
>>> infer from the server refusing to provide information? We don't
>>> permit short reads etc.
>> if the bitmap is 010101010101 we will have too many descriptors. For 
>> example, 16tb disk, 64k granularity -> 2G of descriptors payload.
> Yep. And the cost of consuming and retrying is quite high. One option would 
> be for the client to realise this is a possibility, and not request the 
> entire extent map for a 16TB disk, as it might be very large! Even if the 
> client worked at e.g. a 64MB level (where they'd get a maximum of 1024 
> extents per reply), this isn't going to noticeably increase the round trip 
> timing. One issue here is that to determine a 'reasonable' size, the client 
> needs to know the minimum length of any extent.

and with this approach we will in turn have overhead of too many 
requests for  or  bitmaps.

>
> I think the answer is probably a 'maximum number of extents' in the request 
> packet.
>
> Of course with statuses in extent, the final extent could be represented as 
> 'I don't know, break this bit into a separate request' status.
>

With such predefined status, we can postpone creating extended requests, 
have number of extents restricted by server and have sum of extents 
lengths be equal to request length.


-- 
Best regards,
Vladimir


--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Wouter Verhelst
On Tue, Nov 29, 2016 at 06:07:56PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 29.11.2016 17:52, Alex Bligh wrote:
> > Vladimir,
> >> if the bitmap is 010101010101 we will have too many descriptors.
> >> For example, 16tb disk, 64k granularity -> 2G of descriptors
> >> payload.
> > Yep. And the cost of consuming and retrying is quite high. One
> > option would be for the client to realise this is a possibility, and
> > not request the entire extent map for a 16TB disk, as it might be
> > very large! Even if the client worked at e.g. a 64MB level (where
> > they'd get a maximum of 1024 extents per reply), this isn't going to
> > noticeably increase the round trip timing. One issue here is that to
> > determine a 'reasonable' size, the client needs to know the minimum
> > length of any extent.
> 
> and with this approach we will in turn have overhead of too many 
> requests for  or  bitmaps.

This is why my proposal suggests the server may abort the sent extents
after X extents (sixteen, but that number can certainly change) have
been sent. After all, the server will have a better view on what's going
to be costly in terms of "number of extents".

> > I think the answer is probably a 'maximum number of extents' in the request 
> > packet.
> >
> > Of course with statuses in extent, the final extent could be
> > represented as 'I don't know, break this bit into a separate
> > request' status.
> >
> 
> With such predefined status, we can postpone creating extended requests, 
> have number of extents restricted by server and have sum of extents 
> lengths be equal to request length.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh
Vladimir,

>>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>>   cons: Two commands with almost same semantic and similar means?
>>>   pros: However here is a good point of separating clearly defined and 
>>> native
>>> for block devices GET_BLOCK_STATUS from user-driven and actually
>>> undefined data, called 'dirtyness'.
>> I'm suggesting one generic 'read bitmap' command like you.
> 
> To support get_block_status in this general read_bitmap, we will need to 
> define something like 'multibitmap', which allows several bits per chunk, as 
> allocation data has two: zero and allocated.

I think you are saying that for arbitrary 'bitmap' there might be more than one 
state. For instance, one might (in an allocation 'bitmap') have a hole, a 
non-hole-zero, or a non-hole-non-zero.

In the spec I'd suggest, for one 'bitmap', we represent the output as extents. 
Each extent has a status. For the bitmap to be useful, at least two status need 
to be possible, but the above would have three. This could be internally 
implemented by the server as (a) a bitmap (with two bits per entry), (b) two 
bitmaps (possibly with different granularity), (c) something else (e.g. reading 
file extents, then if the data is allocated manually comparing it against zero).

I should have put 'bitmap' in quotes in what I wrote because returning extents 
(as you suggested) is a good idea, and there need not be an actual bitmap.

>>> 5. Number of status descriptors, sent by server, should be restricted
>>>   variants:
>>>   1: just allow server to restrict this as it wants (which was done in v3)
>>>   2: (not excluding 1). Client specifies somehow the maximum for number
>>>  of descriptors.
>>>  2.1: add command flag, which will request only one descriptor
>>>   (otherwise, no restrictions from the client)
>>>  2.2: again, introduce extended nbd requests, and add field to
>>>   specify this maximum
>> I think some form of extended request is the way to go, but out of
>> interest, what's the issue with as many descriptors being sent as it
>> takes to encode the reply? The client can just consume the remainder
>> (without buffering) and reissue the request at a later point for
>> the areas it discarded.
> 
> the issue is: too many descriptors possible. So, (1) solves it. (2) is 
> optional, just to simplify/optimize client side.

I think I'd prefer the server to return what it was asked for, and the client 
to deal with it. So either the client should be able to specify a maximum 
number of extents (and if we are extending the command structure, that's 
possible) or we deal with the client consuming and retrying unwanted extents. 
The reason for this is that it's unlikely the server can know a priori the 
number of extents which is the appropriate maximum for the client.

>>> +The list of block status descriptors within the
>>> +`NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>>> +of the file starting from specified *offset*, and the sum of the
>>> +*length* fields of each descriptor MUST not be greater than the
>>> +overall *length* of the request. This means that the server MAY
>>> +return less data than required. However the server MUST return at
>>> +least one status descriptor
>> I'm not sure I understand why that's useful. What should the client
>> infer from the server refusing to provide information? We don't
>> permit short reads etc.
> 
> if the bitmap is 010101010101 we will have too many descriptors. For example, 
> 16tb disk, 64k granularity -> 2G of descriptors payload.

Yep. And the cost of consuming and retrying is quite high. One option would be 
for the client to realise this is a possibility, and not request the entire 
extent map for a 16TB disk, as it might be very large! Even if the client 
worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per 
reply), this isn't going to noticeably increase the round trip timing. One 
issue here is that to determine a 'reasonable' size, the client needs to know 
the minimum length of any extent.

I think the answer is probably a 'maximum number of extents' in the request 
packet.

Of course with statuses in extent, the final extent could be represented as 'I 
don't know, break this bit into a separate request' status.

-- 
Alex Bligh





--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Vladimir Sementsov-Ogievskiy
29.11.2016 15:57, Alex Bligh wrote:
> Vladimir,
>
> I went back to April to reread the previous train of conversation
> then found you had helpfully summarised some if it. Comments
> below.
>
> Rather than comment on many of the points individual, the root
> of my confusion and to some extent uncomfortableness about this
> proposal is 'who owns the meaning the the bitmaps'.
>
> Some of this is my own confusion (sorry) about the use to which
> this is being put, which is I think at root a documentation issue.
> To illustrate this, you write in the FAQ section that this is for
> read only disks, but the text talks about:
>
> +Some storage formats and operations over such formats express a
> +concept of data dirtiness. Whether the operation is block device
> +mirroring, incremental block device backup or any other operation with
> +a concept of data dirtiness, they all share a need to provide a list
> +of ranges that this particular operation treats as dirty.
>
> How can data be 'dirty' if it is static and unchangeable? (I thought)
>
> I now think what you are talking about backing up a *snapshot* of a disk
> that's running, where the disk itself was not connected using NBD? IE it's
> not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
> an opaque state represented in a bitmap, which is binary metadata
> at some particular level of granularity. It might as well be 'happiness'
> or 'is coloured blue'. The NBD server would (normally) have no way of
> manipulating this bitmap.

Yes, something like this.

Note: in Qemu it may not be a snapshot (actually, I didn't see a way in 
Qemu to export snapshots not switching to them (except opening external 
snapshot as a separate block dev)), but just any read-only drive, or 
temporary drive, used in the image fleecing scheme:

driveA is online normal drive
driveB is empty nbd export

- start backup driveA->driveB with sync=none, which means that the only 
copy which is done is copying old data from A to B before every write to A
- and set driveA as backing for driveB (on read from B, if data is not 
allocated it will be read from A)

after that, driveB is something like a snapshot for backup through NBD.

It's all just to say: calling backup-point-in-time-state just a 
'snapshot' confuses me and I hope not to see this word in the spec (in 
this context).

>
> In previous comments, I said 'how come we can set the dirty bit through
> writes but can't clear it?'. This (my statement) is now I think wrong,
> as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
> state of the bitmap comes from whatever sets the bitmap which is outside
> the scope of this protocol to transmit it.
>
> However, we have the uncomfortable (to me) situation where the protocol
> describes a flag 'dirty', with implications as to what it does, but
> no actual strict definition of how it's set. So any 'other' user has
> no real idea how to use the information, or how to implement a server
> that provides a 'dirty' bit, because the semantics of that aren't within
> the protocol. This does not sit happily with me.
>
> So I'm wondering whether we should simplify and generalise this spec. You
> say that for the dirty flag, there's no specification of when it is
> set and cleared - that's implementation defined. Would it not be better
> then to say 'that whole thing is private to Qemu - even the name'.
>
> Rather you could read the list of bitmaps a server has, with a textual
> name, each having an index (and perhaps a granularity). You could then
> ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
> bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
> and some (e.g. ones beginning with 'X-') could be reserved for user
> usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
> dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
> need a protocol change to support it.
>
> IE rather than looking at 'a way of reading the dirty bit', we could
> have this as a generic way of reading opaque bitmaps. Only one (allocation)
> might be given meaning to start off with, and it wouldn't be necessary
> for all servers to support that - i.e. you could support bitmap reading
> without having an ALLOCATION bitmap available.
>
> This spec would then be limited to the transmission of the bitmaps
> (remove the word 'dirty' entirely, except perhaps as an illustrative
> use case), and include only the definition of the allocation bitmap.

Good point. For Qcow2 we finally come to just bitmaps, not "dirty 
bitmaps", to make it more general. There is a problem with allocation if 
we want to make it a subcase of bitmap: allocation natively have two 
bits per block: zero and allocated. We can of course separate this into 
two bitmaps, but this will not be similar with classic get_block_status.

> Some more nits:
>
>> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>> NBD_FLAG_CAN_MULTI_CONN in master br

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Wouter Verhelst
Hi Vladimir,

On Tue, Nov 29, 2016 at 03:41:10PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Hi,
> 
> 29.11.2016 13:50, Wouter Verhelst wrote:
> 
> Hi,
> 
> On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> 
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the 
> process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only 
> information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.
> 
> To make this a bit more concrete, I've changed the proposal like so:
> 
[...]
> +# Allocation contexts
> +
> +Allocation context 0 is the basic "exists at all" allocation context. If
> +an extent is not allocated at allocation context 0, it MUST NOT be
> +listed as allocated at another allocation context. This supports sparse
> 
> 
> allocated here is range with unset NBD_STATE_HOLE bit?

Eh, yes. I should clarify that a bit further (no time right now though,
but patches are certainly welcome).

> +file semantics on the server side. If a server has only one allocation
> +context (the default), then writing to an extent which is allocated in
> +that allocation context 0 MUST NOT fail with ENOSPC.
> +
> +For all other cases, this specification requires no specific semantics
> +of allocation contexts. Implementations could support allocation
> +contexts with semantics like the following:
> +
> +- Incremental snapshots; if a block is allocated in one allocation
> +  context, that implies that it is also allocated in the next level up.
> +- Various bits of data about the backend of the storage; e.g., if the
> +  storage is written on a RAID array, an allocation context could
> +  return information about the redundancy level of a given extent
> +- If the backend implements a write-through cache of some sort, or
> +  synchronises with other servers, an allocation context could state
> +  that an extent is "allocated" once it has reached permanent storage
> +  and/or is synchronized with other servers.
> +
> +The only requirement of an allocation context is that it MUST be
> +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting allocation contexts in a document like this
> +one.
> 
> 
> IMHO, redundant paragraph for this spec.

The "SHOULD document" one? I want that there at any rate, just to make
clear that it's probably a good thing to have it (but no, it's not part
of the formal protocol spec).

> +
>  ### Transmission phase
> 
>   Flag fields
> @@ -983,6 +1065,9 @@ valid may depend on negotiation during the handshake 
> phase.
> content chunk in reply.  MUST NOT be set unless the transmission
> flags include `NBD_FLAG_SEND_DF`.  Use of this flag MAY trigger an
> `EOVERFLOW` error chunk, if the request length is too large.
> +- bit 3, `NBD_CMD_FLAG_REQ_ONE`; valid during `NBD_CMD_BLOCK_STATUS`. If
> +  set, the client is interested in only one extent per allocation
> +  context.
> 
>  # Structured reply flags
> 
> @@ -1371,38 +1456,48 @@ adds a new `NBD_CMD_BLOCK_STATUS` command which 
> returns a list of
>  ranges with their respective states.  This extension is not available
>  unless the client also negotiates the `STRUCTURED_REPLY` extension.
> 
> -* `NBD_FLAG_SEND_BLOCK_STATUS`
> -
> -The server SHOULD set this transmission flag to 1 if structured
> -replies have been negotiated, and the `NBD_CMD_BLOCK_STATUS`
> -request is supported.
> -
>  * `NBD_REPLY_TYPE_BLOCK_STATUS`
> 
> -*length* MUST be a positive integer multiple of 8.  This reply
> +*length* MUST be 4 + (a positive integer multiple of 8).  This reply
>  represents a series of consecutive block descriptors where the sum
>  of the lengths of the descriptors MUST not be greater than the
> -length of the original request.  This chunk type MUST appear at most
> -once in a structured reply. Valid as a reply to
> +length of the original request. This chunk type MUST appear at most
> +once per allocation ID i

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh

> On 29 Nov 2016, at 10:50, Wouter Verhelst  wrote:
> 
> +- `NBD_OPT_ALLOC_CONTEXT` (10)
> +
> +Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
> +followed by an `NBD_REP_ACK`. If a server replies to such a request
> +with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
> +commands during the transmission phase.

I haven't read this in detail but this seems to me to be similar to
the idea I just posted (sorry - kept getting interrupted whilst writing
the email) re multiple bitmaps?

But the name 'ALLOC_CONTEXT' is a bit weird. Why not call 'metadata
bitmaps' or 'metadata extents' or something. Metadata seems right as
it's data about data.

> +# Allocation contexts
> +
> +Allocation context 0 is the basic "exists at all" allocation context. If
> +an extent is not allocated at allocation context 0, it MUST NOT be
> +listed as allocated at another allocation context. This supports sparse
> +file semantics on the server side. If a server has only one allocation
> +context (the default), then writing to an extent which is allocated in
> +that allocation context 0 MUST NOT fail with ENOSPC.
> +
> +For all other cases, this specification requires no specific semantics
> +of allocation contexts. Implementations could support allocation
> +contexts with semantics like the following:
> +
> +- Incremental snapshots; if a block is allocated in one allocation
> +  context, that implies that it is also allocated in the next level up.
> +- Various bits of data about the backend of the storage; e.g., if the
> +  storage is written on a RAID array, an allocation context could
> +  return information about the redundancy level of a given extent
> +- If the backend implements a write-through cache of some sort, or
> +  synchronises with other servers, an allocation context could state
> +  that an extent is "allocated" once it has reached permanent storage
> +  and/or is synchronized with other servers.
> +
> +The only requirement of an allocation context is that it MUST be
> +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting allocation contexts in a document like this
> +one.
> +

But this seems strange. Whether something is 'allocated' seems orthogonal
to me to whether it needs backing up. Even the fact it's been zeroed
(now a hole) might need backing up).

So don't we need multiple independent lists of extents? Of course a server
might *implement* them under the hood with separate bitmaps or one big
bitmap, or no bitmap at all (for instance using file extents on POSIX).



-- 
Alex Bligh





--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh
Vladimir,

I went back to April to reread the previous train of conversation
then found you had helpfully summarised some if it. Comments
below.

Rather than comment on many of the points individual, the root
of my confusion and to some extent uncomfortableness about this
proposal is 'who owns the meaning the the bitmaps'.

Some of this is my own confusion (sorry) about the use to which
this is being put, which is I think at root a documentation issue.
To illustrate this, you write in the FAQ section that this is for
read only disks, but the text talks about:

+Some storage formats and operations over such formats express a
+concept of data dirtiness. Whether the operation is block device
+mirroring, incremental block device backup or any other operation with
+a concept of data dirtiness, they all share a need to provide a list
+of ranges that this particular operation treats as dirty.

How can data be 'dirty' if it is static and unchangeable? (I thought)

I now think what you are talking about backing up a *snapshot* of a disk
that's running, where the disk itself was not connected using NBD? IE it's
not being 'made dirty' by NBD_CMD_WRITE etc. Rather 'dirtiness' is effectively
an opaque state represented in a bitmap, which is binary metadata
at some particular level of granularity. It might as well be 'happiness'
or 'is coloured blue'. The NBD server would (normally) have no way of
manipulating this bitmap.

In previous comments, I said 'how come we can set the dirty bit through
writes but can't clear it?'. This (my statement) is now I think wrong,
as NBD_CMD_WRITE etc. is not defined to set the dirty bit. The
state of the bitmap comes from whatever sets the bitmap which is outside
the scope of this protocol to transmit it.

However, we have the uncomfortable (to me) situation where the protocol
describes a flag 'dirty', with implications as to what it does, but
no actual strict definition of how it's set. So any 'other' user has
no real idea how to use the information, or how to implement a server
that provides a 'dirty' bit, because the semantics of that aren't within
the protocol. This does not sit happily with me.

So I'm wondering whether we should simplify and generalise this spec. You
say that for the dirty flag, there's no specification of when it is
set and cleared - that's implementation defined. Would it not be better
then to say 'that whole thing is private to Qemu - even the name'.

Rather you could read the list of bitmaps a server has, with a textual
name, each having an index (and perhaps a granularity). You could then
ask on NBD_CMD_BLOCK_STATUS for the appropriate index, and get back that
bitmap value. Some names (e.g. 'ALLOCATED') could be defined in the spec,
and some (e.g. ones beginning with 'X-') could be reserved for user
usage. So you could use 'X-QEMU-DIRTY'). If you then change what your
dirty bit means, you could use 'X-QEMU-DIRTY2' or similar, and not
need a protocol change to support it.

IE rather than looking at 'a way of reading the dirty bit', we could
have this as a generic way of reading opaque bitmaps. Only one (allocation)
might be given meaning to start off with, and it wouldn't be necessary
for all servers to support that - i.e. you could support bitmap reading
without having an ALLOCATION bitmap available.

This spec would then be limited to the transmission of the bitmaps
(remove the word 'dirty' entirely, except perhaps as an illustrative
use case), and include only the definition of the allocation bitmap.

Some more nits:

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
> NBD_FLAG_CAN_MULTI_CONN in master branch.
> 
> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).

Each documentation branch should normally be branched off master unless
it depends on another extension (in which case it will be branched from that).
I haven't been rebasing them frequently as it can disrupt those working
on the branches. There's only really an issue around rebasing where you
depend on another branch.


> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>   A: 1: server replies with status descriptors of any size, granularity
> is hidden from the client
>  2: dirty/allocated requests are separate and unrelated to each
> other, so their granularities are not intersecting

I'm OK with this, but note that you do actually mention a granularity
of sorts in the spec (512 byes) - I think you should replace that
with the minimum block size.

> 3. Q: selecting of dirty bitmap to export
>   A: several variants:
>  1: id of bitmap is in flags field of request
>  pros: - simple
>  cons: - it's a hack. flags field is for other uses.
>- we'll have to map bitmap names to these "ids"
>  2: introduce extended nbd requests with variable length and exploit this
> feature for BLOCK_STATUS command, 

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Vladimir Sementsov-Ogievskiy

Hi,

29.11.2016 13:50, Wouter Verhelst wrote:

Hi,

On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:

However, I'm arguing that if we're going to provide information about
snapshots, we should be able to properly refer to these snapshots from
within an NBD context. My previous mail suggested adding a negotiation
message that would essentially ask the server "tell me about the
snapshots you know about", giving them an NBD identifier in the process
(accompanied by a "foreign" identifier that is decidedly *not* an NBD
identifier and that could be used to match the NBD identifier to
something implementation-defined). This would be read-only information;
the client cannot ask the server to create new snapshots. We can then
later in the protocol refer to these snapshots by way of that NBD
identifier.

To make this a bit more concrete, I've changed the proposal like so:

 From bc6d0df4156e670be7b6adea4f2813f44ffa7202 Mon Sep 17 00:00:00 2001
From: Wouter Verhelst 
Date: Tue, 29 Nov 2016 11:46:04 +0100
Subject: [PATCH] Update with allocation contexts etc

Signed-off-by: Wouter Verhelst 
---
  doc/proto.md | 210 +--
  1 file changed, 145 insertions(+), 65 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index dfdd06d..fe7ae53 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -768,8 +768,6 @@ The field has the following format:
to that command to the client. In the absense of this flag, clients
SHOULD NOT multiplex their commands over more than one connection to
the export.
-- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
-  `BLOCK_STATUS` extension; see below.
  
  Clients SHOULD ignore unknown flags.
  
@@ -871,6 +869,46 @@ of the newstyle negotiation.
  
  Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
  
+- `NBD_OPT_ALLOC_CONTEXT` (10)

+
+Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
+followed by an `NBD_REP_ACK`. If a server replies to such a request
+with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
+commands during the transmission phase.
+
+If the query string is syntactically invalid, the server SHOULD send
+`NBD_REP_ERR_INVALID`. If the query string is syntactically valid
+but finds no allocation contexts, the server MUST send a single
+reply of type `NBD_REP_ACK`.
+
+This option MUST NOT be requested unless structured replies have
+been negotiated first. If a client attempts to do so, a server
+SHOULD send `NBD_REP_ERR_INVALID`.
+
+Data:
+- 32 bits, type
+- String, query to select a subset of the available allocation
+  contexts. If this is not specified (i.e., length is 4 and no
+  command is sent), then the server MUST send all the allocation
+  contexts it knows about. If specified, this query string MUST
+  start with a name that uniquely identifies a server
+  implementation; e.g., the reference implementation that
+  accompanies this document would support query strings starting
+  with 'nbd-server:'
+
+The type may be one of:
+- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+  selected by the query string is returned to the client without
+  changing any state (i.e., this does not add allocation contexts
+  for further usage).
+- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+  selected by the query string is added to the list of existing
+  allocation contexts.
+- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+  selected by the query string is removed from the list of used
+  allocation contexts. Servers SHOULD NOT reuse existing allocation
+  context IDs.
+
   Option reply types
  
  These values are used in the "reply type" field, sent by the server

@@ -901,6 +939,18 @@ during option haggling in the fixed newstyle negotiation.
  
  Defined by the experimental `INFO` [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
  
+- `NBD_REP_ALLOC_CONTEXT` (4)

+
+A description of an allocation context. Data:
+
+- 32 bits, NBD allocation context ID. If the request was NOT of type
+  `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
+- String, name of the allocation context. This is not required to be
+  a human-readable string, but it MUST be valid UTF-8 data.
+
+Allocation context ID 0 is implied, and always exists. It cannot be
+removed.
+
  There are a number of error reply types, all of which are denoted by
  having bit 31 set. All error replies MAY have some data set, in which
  case that data is an error message string suitable for display to the user.
@@ -938,15 +988,47 @@ case that data is an error message string suitable for 
display to the user.
  
  Defined by the experimental `INFO` [extension](https://github.com/Networ

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Vladimir Sementsov-Ogievskiy
29.11.2016 13:18, Kevin Wolf wrote:
> Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben:
>>> 3. Q: selecting of dirty bitmap to export
>>> A: several variants:
>>>1: id of bitmap is in flags field of request
>>>pros: - simple
>>>cons: - it's a hack. flags field is for other uses.
>>>  - we'll have to map bitmap names to these "ids"
>>>2: introduce extended nbd requests with variable length and exploit 
>>> this
>>>   feature for BLOCK_STATUS command, specifying bitmap identifier.
>>>   pros: - looks like a true way
>>>   cons: - we have to create additional extension
>>> - possible we have to create a map,
>>>   { <=> }
>>>3: exteranl tool should select which bitmap to export. So, in case 
>>> of Qemu
>>>   it should be something like qmp command block-export-dirty-bitmap.
>>>   pros: - simple
>>> - we can extend it to behave like (2) later
>>>   cons: - additional qmp command to implement (possibly, the lesser 
>>> evil)
>>>   note: Hmm, external tool can make chose between allocated/dirty 
>>> data too,
>>> so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
>> Downside of 3, though, is that it moves the definition of what the
>> different states mean outside of the NBD protocol (i.e., the protocol
>> messages are not entirely defined anymore, and their meaning depends on
>> the clients and servers in use).
> Another point to consider is that option 3 doesn't allow you to access
> two (or more) different bitmaps from the client without using the side
> channel all the time to switch back and forth between them and having to
> drain the request queue each time to avoid races.
>
> In general, if we have something that "looks like the true way", I'd
> advocate to choose this option. Experience tells that we'd regret
> anything simpler in a year or two, and then we'll have to do the real
> thing anyway, but still need to support the quick hack for
> compatibility.
>
>>> 5. Number of status descriptors, sent by server, should be restricted
>>> variants:
>>> 1: just allow server to restrict this as it wants (which was done in v3)
>>> 2: (not excluding 1). Client specifies somehow the maximum for number
>>>of descriptors.
>>>2.1: add command flag, which will request only one descriptor
>>> (otherwise, no restrictions from the client)
>>>2.2: again, introduce extended nbd requests, and add field to
>>> specify this maximum
>> I think having a flag which requests just one descriptor can be useful,
>> but I'm hesitant to add it unless it's actually going to be used; so in
>> other words, I'll leave the decision on that bit to you.
> The native qemu block layer interface returns the status of only one
> contiguous chunks, so the easiest way to implement the NBD block driver
> in qemu would always use that bit.
>
> On the other hand, it would be possible for the NBD block driver in qemu
> to cache the rest of the response internally and answer the next request
> from the cache instead of sending a new request over the network. Maybe
> that's what it should be doing anyway for good performance.
>
>>> Also, an idea on 2-4:
>>>
>>>  As we say, that dirtiness is unknown for NBD, and external tool
>>>  should specify, manage and understand, which data is actually
>>>  transmitted, why not just call it user_data and leave status field
>>>  of reply chunk unspecified in this case?
>>>
>>>  So, I propose one flag for NBD_CMD_BLOCK_STATUS:
>>>  NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
>>>  Eric's 'Block provisioning status' paragraph.  If it is set, we just
>>>  leave status field for some external... protocol? Who knows, what is
>>>  this user data.
>> Yes, this sounds like a reasonable approach.
> Makes sense to me, too.
>
> However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also
> have use for a second one. If we go with one of the options where the
> bitmap is selected per command, we're fine because you can simply move
> the second bit to a different bitmap and do two requests. If we have
> only a single active bitmap at a time, though, this feels like an actual
> problem.
>
>>> Another idea, about backups themselves:
>>>
>>>  Why do we need allocated/zero status for backup? IMHO we don't.
>> Well, I've been thinking so all along, but then I don't really know what
>> it is, in detail, that you want to do :-)
> I think we do. A block can be dirtied by discarding/write_zero-ing it.
> Then we want the dirty bit to know that we need to include this block in
> the incremental backup, but we also want to know that we don't actually
> have to transfer the data in it.

And we will know that automatically, by using structured read command, 
so separate call to get_block_status is not needed.

>
> K

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Wouter Verhelst
Hi,

On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.

To make this a bit more concrete, I've changed the proposal like so:

>From bc6d0df4156e670be7b6adea4f2813f44ffa7202 Mon Sep 17 00:00:00 2001
From: Wouter Verhelst 
Date: Tue, 29 Nov 2016 11:46:04 +0100
Subject: [PATCH] Update with allocation contexts etc

Signed-off-by: Wouter Verhelst 
---
 doc/proto.md | 210 +--
 1 file changed, 145 insertions(+), 65 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index dfdd06d..fe7ae53 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -768,8 +768,6 @@ The field has the following format:
   to that command to the client. In the absense of this flag, clients
   SHOULD NOT multiplex their commands over more than one connection to
   the export.
-- bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`; defined by the experimental
-  `BLOCK_STATUS` extension; see below.
 
 Clients SHOULD ignore unknown flags.
 
@@ -871,6 +869,46 @@ of the newstyle negotiation.
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+- `NBD_OPT_ALLOC_CONTEXT` (10)
+
+Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
+followed by an `NBD_REP_ACK`. If a server replies to such a request
+with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
+commands during the transmission phase.
+
+If the query string is syntactically invalid, the server SHOULD send
+`NBD_REP_ERR_INVALID`. If the query string is syntactically valid
+but finds no allocation contexts, the server MUST send a single
+reply of type `NBD_REP_ACK`.
+
+This option MUST NOT be requested unless structured replies have
+been negotiated first. If a client attempts to do so, a server
+SHOULD send `NBD_REP_ERR_INVALID`.
+
+Data:
+- 32 bits, type
+- String, query to select a subset of the available allocation
+  contexts. If this is not specified (i.e., length is 4 and no
+  command is sent), then the server MUST send all the allocation
+  contexts it knows about. If specified, this query string MUST
+  start with a name that uniquely identifies a server
+  implementation; e.g., the reference implementation that
+  accompanies this document would support query strings starting
+  with 'nbd-server:'
+
+The type may be one of:
+- `NBD_ALLOC_LIST_CONTEXT` (1): the list of allocation contexts
+  selected by the query string is returned to the client without
+  changing any state (i.e., this does not add allocation contexts
+  for further usage).
+- `NBD_ALLOC_ADD_CONTEXT` (2): the list of allocation contexts
+  selected by the query string is added to the list of existing
+  allocation contexts.
+- `NBD_ALLOC_DEL_CONTEXT` (3): the list of allocation contexts
+  selected by the query string is removed from the list of used
+  allocation contexts. Servers SHOULD NOT reuse existing allocation
+  context IDs.
+
  Option reply types
 
 These values are used in the "reply type" field, sent by the server
@@ -901,6 +939,18 @@ during option haggling in the fixed newstyle negotiation.
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).
 
+- `NBD_REP_ALLOC_CONTEXT` (4)
+
+A description of an allocation context. Data:
+
+- 32 bits, NBD allocation context ID. If the request was NOT of type
+  `NBD_ALLOC_LIST_CONTEXT`, this field MUST NOT be zero.
+- String, name of the allocation context. This is not required to be
+  a human-readable string, but it MUST be valid UTF-8 data.
+
+Allocation context ID 0 is implied, and always exists. It cannot be
+removed.
+
 There are a number of error reply types, all of which are denoted by
 having bit 31 set. All error replies MAY have some data set, in which
 case that data is an error message string suitable for display to the user.
@@ -938,15 +988,47 @@ case that data is an error message string suitable for 
display to the user.
 
 Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Kevin Wolf
Am 27.11.2016 um 20:17 hat Wouter Verhelst geschrieben:
> > 3. Q: selecting of dirty bitmap to export
> >A: several variants:
> >   1: id of bitmap is in flags field of request
> >   pros: - simple
> >   cons: - it's a hack. flags field is for other uses.
> > - we'll have to map bitmap names to these "ids"
> >   2: introduce extended nbd requests with variable length and exploit 
> > this
> >  feature for BLOCK_STATUS command, specifying bitmap identifier.
> >  pros: - looks like a true way
> >  cons: - we have to create additional extension
> >- possible we have to create a map,
> >  { <=> }
> >   3: exteranl tool should select which bitmap to export. So, in case of 
> > Qemu
> >  it should be something like qmp command block-export-dirty-bitmap.
> >  pros: - simple
> >- we can extend it to behave like (2) later
> >  cons: - additional qmp command to implement (possibly, the lesser 
> > evil)
> >  note: Hmm, external tool can make chose between allocated/dirty 
> > data too,
> >so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> Downside of 3, though, is that it moves the definition of what the
> different states mean outside of the NBD protocol (i.e., the protocol
> messages are not entirely defined anymore, and their meaning depends on
> the clients and servers in use).

Another point to consider is that option 3 doesn't allow you to access
two (or more) different bitmaps from the client without using the side
channel all the time to switch back and forth between them and having to
drain the request queue each time to avoid races.

In general, if we have something that "looks like the true way", I'd
advocate to choose this option. Experience tells that we'd regret
anything simpler in a year or two, and then we'll have to do the real
thing anyway, but still need to support the quick hack for
compatibility.

> > 5. Number of status descriptors, sent by server, should be restricted
> >variants:
> >1: just allow server to restrict this as it wants (which was done in v3)
> >2: (not excluding 1). Client specifies somehow the maximum for number
> >   of descriptors.
> >   2.1: add command flag, which will request only one descriptor
> >(otherwise, no restrictions from the client)
> >   2.2: again, introduce extended nbd requests, and add field to
> >specify this maximum
> 
> I think having a flag which requests just one descriptor can be useful,
> but I'm hesitant to add it unless it's actually going to be used; so in
> other words, I'll leave the decision on that bit to you.

The native qemu block layer interface returns the status of only one
contiguous chunks, so the easiest way to implement the NBD block driver
in qemu would always use that bit.

On the other hand, it would be possible for the NBD block driver in qemu
to cache the rest of the response internally and answer the next request
from the cache instead of sending a new request over the network. Maybe
that's what it should be doing anyway for good performance.

> > Also, an idea on 2-4:
> > 
> > As we say, that dirtiness is unknown for NBD, and external tool
> > should specify, manage and understand, which data is actually
> > transmitted, why not just call it user_data and leave status field
> > of reply chunk unspecified in this case?
> > 
> > So, I propose one flag for NBD_CMD_BLOCK_STATUS:
> > NBD_FLAG_STATUS_USER. If it is clear, than behaviour is defined by
> > Eric's 'Block provisioning status' paragraph.  If it is set, we just
> > leave status field for some external... protocol? Who knows, what is
> > this user data.
> 
> Yes, this sounds like a reasonable approach.

Makes sense to me, too.

However, if we have use for a single NBD_FLAG_STATUS_USER bit, we also
have use for a second one. If we go with one of the options where the
bitmap is selected per command, we're fine because you can simply move
the second bit to a different bitmap and do two requests. If we have
only a single active bitmap at a time, though, this feels like an actual
problem.

> > Another idea, about backups themselves:
> > 
> > Why do we need allocated/zero status for backup? IMHO we don't.
> 
> Well, I've been thinking so all along, but then I don't really know what
> it is, in detail, that you want to do :-)

I think we do. A block can be dirtied by discarding/write_zero-ing it.
Then we want the dirty bit to know that we need to include this block in
the incremental backup, but we also want to know that we don't actually
have to transfer the data in it.

Kevin

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Stefan Hajnoczi
On Mon, Nov 28, 2016 at 06:33:24PM +0100, Wouter Verhelst wrote:
> Hi Stefan,
> 
> On Mon, Nov 28, 2016 at 11:19:44AM +, Stefan Hajnoczi wrote:
> > On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> > > Quickly: the reason I haven't merged this yes is twofold:
> > > - I wasn't thrilled with the proposal at the time. It felt a bit
> > >   hackish, and bolted onto NBD so you could use it, but without defining
> > >   everything in the NBD protocol. "We're reading some data, but it's not
> > >   about you". That didn't feel right
> > >
> > > - There were a number of questions still unanswered (you're answering a
> > >   few below, so that's good).
> > > 
> > > For clarity, I have no objection whatsoever to adding more commands if
> > > they're useful, but I would prefer that they're also useful with NBD on
> > > its own, i.e., without requiring an initiation or correlation of some
> > > state through another protocol or network connection or whatever. If
> > > that's needed, that feels like I didn't do my job properly, if you get
> > > my point.
> > 
> > The out-of-band operations you are referring to are for dirty bitmap
> > management.  (The goal is to read out blocks that changed since the last
> > backup.)
> > 
> > The client does not access the live disk, instead it accesses a
> > read-only snapshot and the dirty information (so that it can copy out
> > only blocks that were written).  The client is allowed to read blocks
> > that are not dirty too.
> 
> I understood as much, yes.
> 
> > If you want to implement the whole incremental backup workflow in NBD
> > then the client would first have to connect to the live disk, set up
> > dirty tracking, create a snapshot export, and then connect to that
> > snapshot.
> > 
> > That sounds like a big feature set and I'd argue it's for the control
> > plane (storage API) and not the data plane (NBD).  There were
> > discussions about transferring the dirty information via the control
> > plane but it seems more appropriate to it in the data plane since it is
> > block-level information.
> 
> I agree that creating and managing snapshots is out of scope for NBD. The
> protocol is not set up for that.
> 
> However, I'm arguing that if we're going to provide information about
> snapshots, we should be able to properly refer to these snapshots from
> within an NBD context. My previous mail suggested adding a negotiation
> message that would essentially ask the server "tell me about the
> snapshots you know about", giving them an NBD identifier in the process
> (accompanied by a "foreign" identifier that is decidedly *not* an NBD
> identifier and that could be used to match the NBD identifier to
> something implementation-defined). This would be read-only information;
> the client cannot ask the server to create new snapshots. We can then
> later in the protocol refer to these snapshots by way of that NBD
> identifier.
> 
> My proposal also makes it impossible to get updates of newly created
> snapshots without disconnecting and reconnecting (due to the fact that
> you can't go from transmission back to negotiation), but I'm not sure
> that's a problem.
> 
> Doing so has two advantages:
> - If a client is accidentally (due to misconfiguration or implementation
>   bugs or whatnot) connecting to the wrong server after having created a
>   snapshot through a management protocol, we have an opportunity to
>   detect this error, due to the fact that the "foreign" identifiers
>   passed to the client during negotiation will not match with what the
>   client was expecting.
> - A future version of the protocol could possibly include an extended
>   version of the read command, allowing a client to read information
>   from multiple storage snapshots without requiring a reconnect, and
>   allowing current clients information about allocation status across
>   various snapshots (although a first implementation could very well
>   limit itself to only having one snapshot).

Sorry, I misunderstood you.

Snapshots are not very different from NBD exports.  Especially if the
storage system supports writeable-snapshot (aka cloning).  Should we
just used named exports?

Stefan


signature.asc
Description: PGP signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-28 Thread Wouter Verhelst
Hi Stefan,

On Mon, Nov 28, 2016 at 11:19:44AM +, Stefan Hajnoczi wrote:
> On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> > Quickly: the reason I haven't merged this yes is twofold:
> > - I wasn't thrilled with the proposal at the time. It felt a bit
> >   hackish, and bolted onto NBD so you could use it, but without defining
> >   everything in the NBD protocol. "We're reading some data, but it's not
> >   about you". That didn't feel right
> >
> > - There were a number of questions still unanswered (you're answering a
> >   few below, so that's good).
> > 
> > For clarity, I have no objection whatsoever to adding more commands if
> > they're useful, but I would prefer that they're also useful with NBD on
> > its own, i.e., without requiring an initiation or correlation of some
> > state through another protocol or network connection or whatever. If
> > that's needed, that feels like I didn't do my job properly, if you get
> > my point.
> 
> The out-of-band operations you are referring to are for dirty bitmap
> management.  (The goal is to read out blocks that changed since the last
> backup.)
> 
> The client does not access the live disk, instead it accesses a
> read-only snapshot and the dirty information (so that it can copy out
> only blocks that were written).  The client is allowed to read blocks
> that are not dirty too.

I understood as much, yes.

> If you want to implement the whole incremental backup workflow in NBD
> then the client would first have to connect to the live disk, set up
> dirty tracking, create a snapshot export, and then connect to that
> snapshot.
> 
> That sounds like a big feature set and I'd argue it's for the control
> plane (storage API) and not the data plane (NBD).  There were
> discussions about transferring the dirty information via the control
> plane but it seems more appropriate to it in the data plane since it is
> block-level information.

I agree that creating and managing snapshots is out of scope for NBD. The
protocol is not set up for that.

However, I'm arguing that if we're going to provide information about
snapshots, we should be able to properly refer to these snapshots from
within an NBD context. My previous mail suggested adding a negotiation
message that would essentially ask the server "tell me about the
snapshots you know about", giving them an NBD identifier in the process
(accompanied by a "foreign" identifier that is decidedly *not* an NBD
identifier and that could be used to match the NBD identifier to
something implementation-defined). This would be read-only information;
the client cannot ask the server to create new snapshots. We can then
later in the protocol refer to these snapshots by way of that NBD
identifier.

My proposal also makes it impossible to get updates of newly created
snapshots without disconnecting and reconnecting (due to the fact that
you can't go from transmission back to negotiation), but I'm not sure
that's a problem.

Doing so has two advantages:
- If a client is accidentally (due to misconfiguration or implementation
  bugs or whatnot) connecting to the wrong server after having created a
  snapshot through a management protocol, we have an opportunity to
  detect this error, due to the fact that the "foreign" identifiers
  passed to the client during negotiation will not match with what the
  client was expecting.
- A future version of the protocol could possibly include an extended
  version of the read command, allowing a client to read information
  from multiple storage snapshots without requiring a reconnect, and
  allowing current clients information about allocation status across
  various snapshots (although a first implementation could very well
  limit itself to only having one snapshot).

[...]
> I'm arguing that the NBD protocol doesn't need to support the
> incremental backup workflow since it's a complex control plane concept.
> 
> Being able to read dirty information via NBD is useful for other block
> backup applications, not just QEMU.  It could be used for syncing LVM
> volumes across machines, for example, if someone implements an NBD+LVM
> server.

Indeed, and I was considering adding a basic implementation to go with
the copy-on-write support in stock nbd-server, too.

> Another issue with adding control plane operations is that you need to
> begin considering privilege separation.  Should all NBD clients be able
> to initiate snapshots, dirty tracking, etc or is some kind of access
> control required to limit certain commands?  Not all clients require the
> same privileges and so they shouldn't have access to the same set of
> operations.

Sure, which is why I wasn't suggesting anything of the sorts :-)

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

---

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-28 Thread Stefan Hajnoczi
On Sun, Nov 27, 2016 at 08:17:14PM +0100, Wouter Verhelst wrote:
> Quickly: the reason I haven't merged this yes is twofold:
> - I wasn't thrilled with the proposal at the time. It felt a bit
>   hackish, and bolted onto NBD so you could use it, but without defining
>   everything in the NBD protocol. "We're reading some data, but it's not
>   about you". That didn't feel right
>
> - There were a number of questions still unanswered (you're answering a
>   few below, so that's good).
> 
> For clarity, I have no objection whatsoever to adding more commands if
> they're useful, but I would prefer that they're also useful with NBD on
> its own, i.e., without requiring an initiation or correlation of some
> state through another protocol or network connection or whatever. If
> that's needed, that feels like I didn't do my job properly, if you get
> my point.

The out-of-band operations you are referring to are for dirty bitmap
management.  (The goal is to read out blocks that changed since the last
backup.)

The client does not access the live disk, instead it accesses a
read-only snapshot and the dirty information (so that it can copy out
only blocks that were written).  The client is allowed to read blocks
that are not dirty too.

If you want to implement the whole incremental backup workflow in NBD
then the client would first have to connect to the live disk, set up
dirty tracking, create a snapshot export, and then connect to that
snapshot.

That sounds like a big feature set and I'd argue it's for the control
plane (storage API) and not the data plane (NBD).  There were
discussions about transferring the dirty information via the control
plane but it seems more appropriate to it in the data plane since it is
block-level information.

I'm arguing that the NBD protocol doesn't need to support the
incremental backup workflow since it's a complex control plane concept.

Being able to read dirty information via NBD is useful for other block
backup applications, not just QEMU.  It could be used for syncing LVM
volumes across machines, for example, if someone implements an NBD+LVM
server.

Another issue with adding control plane operations is that you need to
begin considering privilege separation.  Should all NBD clients be able
to initiate snapshots, dirty tracking, etc or is some kind of access
control required to limit certain commands?  Not all clients require the
same privileges and so they shouldn't have access to the same set of
operations.

Stefan


signature.asc
Description: PGP signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-27 Thread Wouter Verhelst
Hi Vladimir,

Quickly: the reason I haven't merged this yes is twofold:
- I wasn't thrilled with the proposal at the time. It felt a bit
  hackish, and bolted onto NBD so you could use it, but without defining
  everything in the NBD protocol. "We're reading some data, but it's not
  about you". That didn't feel right
- There were a number of questions still unanswered (you're answering a
  few below, so that's good).

For clarity, I have no objection whatsoever to adding more commands if
they're useful, but I would prefer that they're also useful with NBD on
its own, i.e., without requiring an initiation or correlation of some
state through another protocol or network connection or whatever. If
that's needed, that feels like I didn't do my job properly, if you get
my point.

On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov 
> CC: Denis V. Lunev 
> CC: Wouter Verhelst 
> CC: Paolo Bonzini 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> Signed-off-by: Eric Blake 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
> removing the restriction, that sum of status descriptor lengths must be equal
> to requested length. I.e., let's permit the server to replay with less data
> than required if it wants.

Reasonable, yes. The length that the client requests should be a maximum (i.e.
"I'm interested in this range"), not an exact request.

> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>  NBD_FLAG_CAN_MULTI_CONN in master branch.

Right.

> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).

Probably a good idea, given the above.

> By this resend I just want to continue the diqussion, started about half
> a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>A: This all is for read-only disks, so the data is static and unchangeable.

I think we should declare that it's up to the client to ensure no other
writes happen without its knowledge. This may be because the client and
server communicate out of band about state changes, or because the
client somehow knows that it's the only writer, or whatever.

We can easily do that by declaring that the result of that command only
talks about *current* state, and that concurrent writes by different
clients may invalidate the state. This is true for NBD in general (i.e.,
concurrent read or write commands from other clients may confuse file
systems on top of NBD), so it doesn't change expectations in any way.

> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>A: 1: server replies with status descriptors of any size, granularity
>  is hidden from the client
>   2: dirty/allocated requests are separate and unrelated to each
>  other, so their granularities are not intersecting

Not entirely sure anymore what this is about?

> 3. Q: selecting of dirty bitmap to export
>A: several variants:
>   1: id of bitmap is in flags field of request
>   pros: - simple
>   cons: - it's a hack. flags field is for other uses.
> - we'll have to map bitmap names to these "ids"
>   2: introduce extended nbd requests with variable length and exploit this
>  feature for BLOCK_STATUS command, specifying bitmap identifier.
>  pros: - looks like a true way
>  cons: - we have to create additional extension
>- possible we have to create a map,
>  { <=> }
>   3: exteranl tool should select which bitmap to export. So, in case of 
> Qemu
>   

Re: [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-25 Thread Stefan Hajnoczi
On Fri, Nov 25, 2016 at 02:28:16PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> With the availability of sparse storage formats, it is often needed
> to query status of a particular range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds the BLOCK_STATUS
> extension with one new NBD_CMD_BLOCK_STATUS command, a new
> structured reply chunk format, and a new transmission flag.
> 
> There exists a concept of data dirtiness, which is required
> during, for example, incremental block device backup. To express
> this concept via NBD protocol, this patch also adds a flag to
> NBD_CMD_BLOCK_STATUS to request dirtiness information rather than
> provisioning information; however, with the current proposal, data
> dirtiness is only useful with additional coordination outside of
> the NBD protocol (such as a way to start and stop the server from
> tracking dirty sectors).  Future NBD extensions may add commands
> to control dirtiness through NBD.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI
> "GET LBA STATUS" command more closely, it has been chosen to return
> a list of extents in the response of NBD_CMD_BLOCK_STATUS command,
> instead of a bitmap.
> 
> CC: Pavel Borzenkov 
> CC: Denis V. Lunev 
> CC: Wouter Verhelst 
> CC: Paolo Bonzini 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> Signed-off-by: Eric Blake 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
> 
> v3:
> 
> Hi all. This is almost a resend of v2 (by Eric Blake), The only change is
> removing the restriction, that sum of status descriptor lengths must be equal
> to requested length. I.e., let's permit the server to replay with less data
> than required if it wants.
> 
> Also, bit of NBD_FLAG_SEND_BLOCK_STATUS is changed to 9, as 8 is now
>  NBD_FLAG_CAN_MULTI_CONN in master branch.
> 
> And, finally, I've rebased this onto current state of
> extension-structured-reply branch (which itself should be rebased on
> master IMHO).
> 
> By this resend I just want to continue the diqussion, started about half
> a year ago. Here is a summary of some questions and ideas from v2
> diqussion:
> 
> 1. Q: Synchronisation. Is such data (dirty/allocated) reliable? 
>A: This all is for read-only disks, so the data is static and unchangeable.
> 
> 2. Q: different granularities of dirty/allocated bitmaps. Any problems?
>A: 1: server replies with status descriptors of any size, granularity
>  is hidden from the client
>   2: dirty/allocated requests are separate and unrelated to each
>  other, so their granularities are not intersecting
> 
> 3. Q: selecting of dirty bitmap to export
>A: several variants:
>   1: id of bitmap is in flags field of request
>   pros: - simple
>   cons: - it's a hack. flags field is for other uses.
> - we'll have to map bitmap names to these "ids"
>   2: introduce extended nbd requests with variable length and exploit this
>  feature for BLOCK_STATUS command, specifying bitmap identifier.
>  pros: - looks like a true way
>  cons: - we have to create additional extension
>- possible we have to create a map,
>  { <=> }
>   3: exteranl tool should select which bitmap to export. So, in case of 
> Qemu
>  it should be something like qmp command block-export-dirty-bitmap.
>  pros: - simple
>- we can extend it to behave like (2) later
>  cons: - additional qmp command to implement (possibly, the lesser 
> evil)
>  note: Hmm, external tool can make chose between allocated/dirty data 
> too,
>so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.
> 
> 4. Q: Should not get_{allocated,dirty} be separate commands?
>cons: Two commands with almost same semantic and similar means?
>pros: However here is a good point of separating clearly defined and native
>  for block devices GET_BLOCK_STATUS from user-driven and actually
>  undefined data, called 'dirtyness'.
> 
> 5. Number of status descriptors, sent by server, should be restricted
>variants:
>1: just allow server to restrict this as it wants (which was done in v3)
>2: (not excluding 1). Client specifies somehow the maximum for number
>   of descriptors.
>   2.1: add command flag, which will request only one descriptor
>(otherwise, no restrictions from the client)
>   2.2: again, introduce extended nbd requests, and add field to
>specify this maximum
> 
> 6. A: What to do with unspecified flags (in request/reply)?
>I think the normal variant is to make them reserved. (Server should
>return EINVAL if found unknown bits, client should consider replay
>with unknown bits as an error)
> 
> ==
> 
> Also, an idea on 2-4:
> 
> As we say, that dirtiness is unknown for NBD, and external tool
> should specify, manage and und