Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2022-10-04 Thread Eric Blake
On Fri, Dec 03, 2021 at 05:14:34PM -0600, Eric Blake wrote:
> Add a new negotiation feature where the client and server agree to use
> larger packet headers on every packet sent during transmission phase.
> This has two purposes: first, it makes it possible to perform
> operations like trim, write zeroes, and block status on more than 2^32
> bytes in a single command; this in turn requires that some structured
> replies from the server also be extended to match.  The wording chosen
> here is careful to permit a server to use either flavor in its reply
> (that is, a request less than 32-bits can trigger an extended reply,
> and conversely a request larger than 32-bits can trigger a compact
> reply).

Following up on this original proposal with something that came out of
KVM Forum this year.

> +* `NBD_REPLY_TYPE_BLOCK_STATUS_EXT` (6)
> +
> +  This chunk type is in the status chunk category.  *length* MUST be
> +  4 + (a positive multiple of 16).  The semantics of this chunk mirror
> +  those of `NBD_REPLY_TYPE_BLOCK_STATUS`, other than the use of a
> +  larger *extent length* field, as well as added padding to ease
> +  alignment.  This chunk type MUST NOT be used unless extended headers
> +  were negotiated with `NBD_OPT_EXTENDED_HEADERS`.
> +
> +  The payload starts with:
> +
> +  32 bits, metadata context ID  
> +
> +  and is followed by a list of one or more descriptors, each with this
> +  layout:
> +
> +  64 bits, length of the extent to which the status below
> + applies (unsigned, MUST be nonzero)  
> +  32 bits, status flags  
> +  32 bits, padding (MUST be zero)

During KVM Forum, I had several conversations about Zoned Block
Devices (https://zonedstorage.io/docs/linux/zbd-api), and what it
would take to expose ZBD information over NBD.  In particular,
NBD_CMD_BLOCK_STATUS sounds like a great way for advertising
information about zones (by adding several metadata contexts that can
be negotiated during NBD_OPT_SET_META_CONTEXT), except for the fact
that a zone might be larger than 32 bits in size.  So Rich Jones asked
me the question of whether my work on 64-bit extensions to the NBD
protocol could also allow for a server to advertise a metadata context
only to clients that support 64-bit extensions, at which point it can
report 64-bit offsets or lengths as needed, rather than being limited
to 32-bit status flags.

The idea definitely has merit, so I'm working on incorporating that
into my next revision for 64-bit extensions in NBD.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2022-03-24 Thread Eric Blake
[Updating Vladimir's new preferred address in cc list]

On Thu, Mar 24, 2022 at 07:31:48PM +0200, Wouter Verhelst wrote:
> Hi Eric,
> 
> Thanks for the ping; it had slipped my mind.
> 
> On Fri, Dec 03, 2021 at 05:14:34PM -0600, Eric Blake wrote:
> >   Request message
> > 
> > -The request message, sent by the client, looks as follows:
> > +The compact request message, sent by the client when extended
> > +transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +looks as follows:
> > 
> >  C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
> >  C: 16 bits, command flags  
> > @@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
> >  C: 32 bits, length (unsigned)  
> >  C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> > 
> > +If negotiation agreed on extended transactions with
> > +`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
> > +
> > +C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)  
> > +C: 16 bits, command flags  
> > +C: 16 bits, type  
> > +C: 64 bits, handle  
> > +C: 64 bits, offset (unsigned)  
> > +C: 64 bits, length (unsigned)  
> > +C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> > +
> 
> Perhaps we should decouple the ideas of "effect length" and "payload
> length"? As in,
> 
> C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
> C: 16 bits, command flags
> C: 16 bits, type
> C: 64 bits, handle
> C: 64 bits, offset
> C: 64 bits, effect length
> C: 64 bits, payload length
> C: (*payload length* bytes of data)
> 
> This makes the protocol more context free. With the current set of
> commands, only NBD_CMD_WRITE would have payload length be nonzero, but
> that doesn't have to remain the case forever; e.g., we could have a
> command that extends NBD_CMD_BLOCK_STATUS to only query a subset of the
> metadata contexts that we declared (if that is wanted, of course).
> 
> Of course, that does have the annoying side effect of no longer fitting
> in 32 bytes, requiring a 40-byte header instead. I think it would be
> worth it though.

Could we still keep a 32-byte header, by having a new command (or new
command flag to the existing NBD_CMD_BLOCK_STATUS), such that the
payload itself contains the needed extra bytes?

Hmm - right now, the only command with a payload is NBD_CMD_WRITE, and
all other commands use the length field as an effect length.  So maybe
what we do is have a single command flag that says whether the length
field is serving as payload length or as effect length.  NBD_CMD_WRITE
would always set the new flag (if extended headers were negotiated),
and most other NBD_CMD_* would leave the flag unset, but in the case
of BLOCK_STATUS wanting only a subset of id status reported, we could
then have:

HEADER:
C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
C: 16 bits, command flags, NBD_CMD_FLAG_PAYLOAD
C: 16 bits, type, NBD_CMD_BLOCK_STATUS
C: 64 bits, handle
C: 64 bits, offset
C: 64 bits, payload length = 12 + 4*n
PAYLOAD:
C: 64 bits, effect length (hint on desired range)
C: 32 bits, number of ids = n
C: 32 bits, id[0]
...
C: 32 bits, id[n-1]

vs.

HEADER:
C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
C: 16 bits, command flags, 0
C: 16 bits, type, NBD_CMD_BLOCK_STATUS
C: 64 bits, handle
C: 64 bits, offset
C: 64 bits, effect length (hint on desired range)

HEADER:
C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
C: 16 bits, command flags, NBD_CMD_FLAG_PAYLOAD
C: 16 bits, type, NBD_CMD_WRITE
C: 64 bits, handle
C: 64 bits, offset
C: 64 bits, payload length = n
PAYLOAD:
C: n*8 bits data


> 
> (This is obviously not relevant for reply messages, only for request
> messages)

Staying at a power of 2 may still be worth the expense of a new cmd
flag which must always be set for writes when extended headers are in
use.

> 
> >   Simple reply message
> > 
> >  The simple reply message MUST be sent by the server in response to all
> >  requests if structured replies have not been negotiated using
> > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a 
> > simple
> > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > -but only if the reply has no data payload.  The message looks as
> > -follows:
> > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > +negotiated, a simple reply MAY be used as a reply to any request other
> > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks as follows:
> > 
> >  S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> > `NBD_REPLY_MAGIC`)  
> > @@ -369,6 +398,16 @@ S: 64 bits, handle
> >  S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> >  *error* is zero)  
> > 
> > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks like:
> > +
> > +S: 32 bits, 0x60d12fd6, magic 

Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2022-03-24 Thread Wouter Verhelst
Hi Eric,

Thanks for the ping; it had slipped my mind.

On Fri, Dec 03, 2021 at 05:14:34PM -0600, Eric Blake wrote:
>   Request message
> 
> -The request message, sent by the client, looks as follows:
> +The compact request message, sent by the client when extended
> +transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +looks as follows:
> 
>  C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
>  C: 16 bits, command flags  
> @@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
>  C: 32 bits, length (unsigned)  
>  C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> 
> +If negotiation agreed on extended transactions with
> +`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
> +
> +C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)  
> +C: 16 bits, command flags  
> +C: 16 bits, type  
> +C: 64 bits, handle  
> +C: 64 bits, offset (unsigned)  
> +C: 64 bits, length (unsigned)  
> +C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
> +

Perhaps we should decouple the ideas of "effect length" and "payload
length"? As in,

C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
C: 16 bits, command flags
C: 16 bits, type
C: 64 bits, handle
C: 64 bits, offset
C: 64 bits, effect length
C: 64 bits, payload length
C: (*payload length* bytes of data)

This makes the protocol more context free. With the current set of
commands, only NBD_CMD_WRITE would have payload length be nonzero, but
that doesn't have to remain the case forever; e.g., we could have a
command that extends NBD_CMD_BLOCK_STATUS to only query a subset of the
metadata contexts that we declared (if that is wanted, of course).

Of course, that does have the annoying side effect of no longer fitting
in 32 bytes, requiring a 40-byte header instead. I think it would be
worth it though.

(This is obviously not relevant for reply messages, only for request
messages)

>   Simple reply message
> 
>  The simple reply message MUST be sent by the server in response to all
>  requests if structured replies have not been negotiated using
> -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a 
> simple
> -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> -but only if the reply has no data payload.  The message looks as
> -follows:
> +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> +negotiated, a simple reply MAY be used as a reply to any request other
> +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +the message looks as follows:
> 
>  S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> `NBD_REPLY_MAGIC`)  
> @@ -369,6 +398,16 @@ S: 64 bits, handle
>  S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
>  *error* is zero)  
> 
> +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> +the message looks like:
> +
> +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)  
> +S: 32 bits, error (MAY be zero)  
> +S: 64 bits, handle  
> +S: 128 bits, padding (MUST be zero)  

Should all these requirements about padding not be a SHOULD rather than
a MUST?

[...]
> +* `NBD_OPT_EXTENDED_HEADERS` (11)
> +
> +The client wishes to use extended headers during the transmission
> +phase.  The client MUST NOT send any additional data with the
> +option, and the server SHOULD reject a request that includes data
> +with `NBD_REP_ERR_INVALID`.
> +
> +The server replies with the following, or with an error permitted
> +elsewhere in this document:
> +
> +- `NBD_REP_ACK`: Extended headers have been negotiated; the client
> +  MUST use the 32-byte extended request header, and the server
> +  MUST use the 32-byte extended reply header.
> +- For backwards compatibility, clients SHOULD be prepared to also
> +  handle `NBD_REP_ERR_UNSUP`; in this case, only the compact
> +  transmission headers will be used.
> +
> +If the client requests `NBD_OPT_STARTTLS` after this option, it
> +MUST renegotiate extended headers.
> +

Two thoughts here:

- We should probably allow NBD_REP_ERR_BLOCK_SIZE_REQD as a reply to
  this message; I could imagine a server might not want to talk 64-bit
  lengths if it doesn't know that block sizes are going to be
  reasonable.
- In the same vein, should we perhaps also add an error message for when
  extended headers are negotiated without structured replies? Perhaps a
  server implementation might not want to implement the "extended
  headers but no structured replies" message format.

On that note, while I know I had said earlier that I would prefer not
making this new extension depend on structured replies, in hindsight
perhaps it *is* a good idea to add that dependency; otherwise we create
an extra message format that is really a degenerate case of "we want to
be modern in one way but not in 

Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2022-03-22 Thread Eric Blake
On Tue, Dec 07, 2021 at 06:14:23PM +0200, Wouter Verhelst wrote:
> On Mon, Dec 06, 2021 at 05:00:47PM -0600, Eric Blake wrote:
> > On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy 
> > wrote:
> > > >    Simple reply message
> > > > 
> > > >   The simple reply message MUST be sent by the server in response to all
> > > >   requests if structured replies have not been negotiated using
> > > > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been 
> > > > negotiated, a simple
> > > > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > > > -but only if the reply has no data payload.  The message looks as
> > > > -follows:
> > > > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > > > +negotiated, a simple reply MAY be used as a reply to any request other
> > > > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > > > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > > +the message looks as follows:
> > > > 
> > > >   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> > > >  `NBD_REPLY_MAGIC`)
> > > > @@ -369,6 +398,16 @@ S: 64 bits, handle
> > > >   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` 
> > > > and
> > > >   *error* is zero)
> > > > 
> > > > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > > +the message looks like:
> > > > +
> > > > +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
> > > > +S: 32 bits, error (MAY be zero)
> > > > +S: 64 bits, handle
> > > > +S: 128 bits, padding (MUST be zero)
> > > > +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > > > +*error* is zero)
> > > > +
> > > 
> > > If we go this way, let's put payload length into padding: it will help to 
> > > make the protocol context-independent and less error-prone.
> 
> Agreed.
> 
> > Easy enough to do (the payload length will be 0 except for
> > NBD_CMD_READ).
> 
> Indeed.
> 
> > > Or, the otherway, may be just forbid the payload for simple-64bit ? 
> > > What's the reason to allow 64bit requests without structured reply 
> > > negotiation?
> > 
> > The two happened to be orthogonal enough in my implementation.  It was
> > easy to demonstrate either one without the other, and it IS easier to
> > write a client using non-structured replies (structured reads ARE
> > tougher than simple reads, even if it is less efficient when it comes
> > to reading zeros).  But you are also right that we could require
> > structured reads prior to allowing 64-bit operations, and then have
> > only one supported reply type on the wire when negotiated.  Wouter,
> > which way do you prefer?
> 
> Given that I *still* haven't gotten around to implementing structured
> replies for nbd-server, I'd prefer not to require it, but that's not
> really a decent argument IMO :-)
> 
> [... I haven't read this in much detail yet, intend to do that later...]

Ping - any other responses on this thread, before I start working on
version 2 of the cross-project patches?

And repeating a comment from my original cover letter:

> with 64-bit commands, we may want to also make it easier to let
> servers advertise an actual maximum size they are willing to accept
> for the commands in question (for example, a server may be happy with
> a full 64-bit block status, but still want to limit non-fast zero and
> cache to a smaller limit to avoid denial of service).

Is it worth enhancing NBD_OPT_INFO in my v2?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-10 Thread Vladimir Sementsov-Ogievskiy

04.12.2021 02:14, Eric Blake wrote:

Add a new negotiation feature where the client and server agree to use
larger packet headers on every packet sent during transmission phase.
This has two purposes: first, it makes it possible to perform
operations like trim, write zeroes, and block status on more than 2^32
bytes in a single command; this in turn requires that some structured
replies from the server also be extended to match.  The wording chosen
here is careful to permit a server to use either flavor in its reply
(that is, a request less than 32-bits can trigger an extended reply,
and conversely a request larger than 32-bits can trigger a compact
reply).



About this.. Isn't it too permissive?

I think that actually having to very similar ways to do the same thing is 
usually a bad design. I think we don't want someone implement the logic, which 
tries to send 32bit commands/replies for small requests and 64bit 
command/replies for larger ones? Moreover, you don't allow doing it for 
commands. So, for symmetry, it may be good to be strict with replies too: in 
64bit mode only 64bit replies.

Now we of course have to support old 32bit commands and new 64bit commands. 
But, may be, we'll want to deprecate 32bit commands at some moment? I'm not 
sure we can deprecate them in protocol, but we can deprecate them in Qemu at 
least. And several years later we'll drop old code, keeping only support for 
64bit commands. Less code paths, less similar structures, simpler code, I think 
it worth it.


--
Best regards,
Vladimir



Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-10 Thread Vladimir Sementsov-Ogievskiy

07.12.2021 12:08, Vladimir Sementsov-Ogievskiy wrote:

07.12.2021 02:00, Eric Blake wrote:

On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:


[..]




+S: 64 bits, padding (MUST be zero)


Hmm. Extra 8 bytes to be power-of-2. Does 32 bytes really perform better than 
24 bytes?


Consider:
struct header[100];

if sizeof(header[0]) is a power of 2 <= the cache line size (and the
compiler prefers to start arrays aligned to the cache line) then we
are guaranteed that all array members each reside in a single cache
line.  But if it is not a power of 2, some of the array members
straddle two cache lines.

Will there be code that wants to create an array of headers?  Perhaps
so, because that is a logical way (along with scatter/gather to
combine the header with variable-sized payloads) of tracking the
headers for multiple commands issued in parallel.

Do I have actual performance numbers?  No. But there's plenty of
google hits for why sizing structs to a power of 2 is a good idea.


I have a thought:

If client stores headers in separate, nothing prevents make this padding in 
RAM. So you can define header struct with padding. But what a reason to make 
the padding in the stream? You can have and array of good-aligned structures, 
but fill only part of header structure reading from the socket. Note, that you 
can read only one header in one read() call anyway, as you have to analyze, 
does it have payload or not.

So, if we want to improve performance by padding the structures in RAM, it's 
not a reason for padding them in the wire, keeping in mind that we can't read 
more then one structure at once.


--
Best regards,
Vladimir



Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-07 Thread Wouter Verhelst
On Mon, Dec 06, 2021 at 05:00:47PM -0600, Eric Blake wrote:
> On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > >    Simple reply message
> > > 
> > >   The simple reply message MUST be sent by the server in response to all
> > >   requests if structured replies have not been negotiated using
> > > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, 
> > > a simple
> > > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > > -but only if the reply has no data payload.  The message looks as
> > > -follows:
> > > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > > +negotiated, a simple reply MAY be used as a reply to any request other
> > > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > +the message looks as follows:
> > > 
> > >   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> > >  `NBD_REPLY_MAGIC`)
> > > @@ -369,6 +398,16 @@ S: 64 bits, handle
> > >   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > >   *error* is zero)
> > > 
> > > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > > +the message looks like:
> > > +
> > > +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
> > > +S: 32 bits, error (MAY be zero)
> > > +S: 64 bits, handle
> > > +S: 128 bits, padding (MUST be zero)
> > > +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > > +*error* is zero)
> > > +
> > 
> > If we go this way, let's put payload length into padding: it will help to 
> > make the protocol context-independent and less error-prone.

Agreed.

> Easy enough to do (the payload length will be 0 except for
> NBD_CMD_READ).

Indeed.

> > Or, the otherway, may be just forbid the payload for simple-64bit ? What's 
> > the reason to allow 64bit requests without structured reply negotiation?
> 
> The two happened to be orthogonal enough in my implementation.  It was
> easy to demonstrate either one without the other, and it IS easier to
> write a client using non-structured replies (structured reads ARE
> tougher than simple reads, even if it is less efficient when it comes
> to reading zeros).  But you are also right that we could require
> structured reads prior to allowing 64-bit operations, and then have
> only one supported reply type on the wire when negotiated.  Wouter,
> which way do you prefer?

Given that I *still* haven't gotten around to implementing structured
replies for nbd-server, I'd prefer not to require it, but that's not
really a decent argument IMO :-)

[... I haven't read this in much detail yet, intend to do that later...]

-- 
 w@uter.{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}



Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-07 Thread Vladimir Sementsov-Ogievskiy

07.12.2021 02:00, Eric Blake wrote:

On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:

    Simple reply message

   The simple reply message MUST be sent by the server in response to all
   requests if structured replies have not been negotiated using
-`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a 
simple
-reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
-but only if the reply has no data payload.  The message looks as
-follows:
+`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
+negotiated, a simple reply MAY be used as a reply to any request other
+than `NBD_CMD_READ`, but only if the reply has no data payload.  If
+extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+the message looks as follows:

   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
  `NBD_REPLY_MAGIC`)
@@ -369,6 +398,16 @@ S: 64 bits, handle
   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
   *error* is zero)

+If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+the message looks like:
+
+S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
+S: 32 bits, error (MAY be zero)
+S: 64 bits, handle
+S: 128 bits, padding (MUST be zero)
+S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
+*error* is zero)
+


If we go this way, let's put payload length into padding: it will help to make 
the protocol context-independent and less error-prone.


Easy enough to do (the payload length will be 0 except for
NBD_CMD_READ).



Or, the otherway, may be just forbid the payload for simple-64bit ? What's the 
reason to allow 64bit requests without structured reply negotiation?


The two happened to be orthogonal enough in my implementation.  It was
easy to demonstrate either one without the other, and it IS easier to
write a client using non-structured replies (structured reads ARE
tougher than simple reads, even if it is less efficient when it comes
to reading zeros).  But you are also right that we could require
structured reads prior to allowing 64-bit operations, and then have
only one supported reply type on the wire when negotiated.  Wouter,
which way do you prefer?




    Structured reply chunk message

   Some of the major downsides of the default simple reply to
@@ -410,7 +449,9 @@ considered successful only if it did not contain any error 
chunks,
   although the client MAY be able to determine partial success based
   on the chunks received.

-A structured reply chunk message looks as follows:
+If extended headers were not negotiated using
+`NBD_OPT_EXTENDED_HEADERS`, a structured reply chunk message looks as
+follows:

   S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)
   S: 16 bits, flags
@@ -423,6 +464,17 @@ The use of *length* in the reply allows context-free 
division of
   the overall server traffic into individual reply messages; the
   *type* field describes how to further interpret the payload.

+If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+the message looks like:
+
+S: 32 bits, 0x6e8a278c, magic (`NBD_STRUCTURED_REPLY_EXT_MAGIC`)
+S: 16 bits, flags
+S: 16 bits, type
+S: 64 bits, handle
+S: 64 bits, length of payload (unsigned)


Maybe, 64bits is too much for payload. But who knows. And it's good that it's 
symmetric to 64bit length in request.


Indeed, both qemu and libnbd implementations explicitly kill the
connection to any server that replies with more than the max buffer
used for NBD_CMD_READ/WRITE (32M for qemu, 64M for libnbd).  And if
the spec is not already clear on the topic, I should add an
independent patch to NBD_CMD_BLOCK_STATUS to make it obvious that a
server cannot reply with too many extents because of such clients.

So none of my proof-of-concept code ever used the full 64-bits of the
reply header length.  On the other hand, there is indeed the symmetry
argument - if someone writes a server willing to accept a 4G
NBD_CMD_WRITE, then it should also support a 4G NBD_CMD_READ, even if
no known existing server or client allows buffers that large..




+S: 64 bits, padding (MUST be zero)


Hmm. Extra 8 bytes to be power-of-2. Does 32 bytes really perform better than 
24 bytes?


Consider:
struct header[100];

if sizeof(header[0]) is a power of 2 <= the cache line size (and the
compiler prefers to start arrays aligned to the cache line) then we
are guaranteed that all array members each reside in a single cache
line.  But if it is not a power of 2, some of the array members
straddle two cache lines.

Will there be code that wants to create an array of headers?  Perhaps
so, because that is a logical way (along with scatter/gather to
combine the header with variable-sized payloads) of tracking the
headers for multiple commands issued in parallel.

Do I have actual performance numbers?  No. But there's plenty of
google hits for why sizing structs to a power of 2 is a 

Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-06 Thread Eric Blake
On Mon, Dec 06, 2021 at 02:40:45PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> >    Simple reply message
> > 
> >   The simple reply message MUST be sent by the server in response to all
> >   requests if structured replies have not been negotiated using
> > -`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a 
> > simple
> > -reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
> > -but only if the reply has no data payload.  The message looks as
> > -follows:
> > +`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
> > +negotiated, a simple reply MAY be used as a reply to any request other
> > +than `NBD_CMD_READ`, but only if the reply has no data payload.  If
> > +extended headers were not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks as follows:
> > 
> >   S: 32 bits, 0x67446698, magic (`NBD_SIMPLE_REPLY_MAGIC`; used to be
> >  `NBD_REPLY_MAGIC`)
> > @@ -369,6 +398,16 @@ S: 64 bits, handle
> >   S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> >   *error* is zero)
> > 
> > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks like:
> > +
> > +S: 32 bits, 0x60d12fd6, magic (`NBD_SIMPLE_REPLY_EXT_MAGIC`)
> > +S: 32 bits, error (MAY be zero)
> > +S: 64 bits, handle
> > +S: 128 bits, padding (MUST be zero)
> > +S: (*length* bytes of data if the request is of type `NBD_CMD_READ` and
> > +*error* is zero)
> > +
> 
> If we go this way, let's put payload length into padding: it will help to 
> make the protocol context-independent and less error-prone.

Easy enough to do (the payload length will be 0 except for
NBD_CMD_READ).

> 
> Or, the otherway, may be just forbid the payload for simple-64bit ? What's 
> the reason to allow 64bit requests without structured reply negotiation?

The two happened to be orthogonal enough in my implementation.  It was
easy to demonstrate either one without the other, and it IS easier to
write a client using non-structured replies (structured reads ARE
tougher than simple reads, even if it is less efficient when it comes
to reading zeros).  But you are also right that we could require
structured reads prior to allowing 64-bit operations, and then have
only one supported reply type on the wire when negotiated.  Wouter,
which way do you prefer?

> 
> >    Structured reply chunk message
> > 
> >   Some of the major downsides of the default simple reply to
> > @@ -410,7 +449,9 @@ considered successful only if it did not contain any 
> > error chunks,
> >   although the client MAY be able to determine partial success based
> >   on the chunks received.
> > 
> > -A structured reply chunk message looks as follows:
> > +If extended headers were not negotiated using
> > +`NBD_OPT_EXTENDED_HEADERS`, a structured reply chunk message looks as
> > +follows:
> > 
> >   S: 32 bits, 0x668e33ef, magic (`NBD_STRUCTURED_REPLY_MAGIC`)
> >   S: 16 bits, flags
> > @@ -423,6 +464,17 @@ The use of *length* in the reply allows context-free 
> > division of
> >   the overall server traffic into individual reply messages; the
> >   *type* field describes how to further interpret the payload.
> > 
> > +If extended headers were negotiated using `NBD_OPT_EXTENDED_HEADERS`,
> > +the message looks like:
> > +
> > +S: 32 bits, 0x6e8a278c, magic (`NBD_STRUCTURED_REPLY_EXT_MAGIC`)
> > +S: 16 bits, flags
> > +S: 16 bits, type
> > +S: 64 bits, handle
> > +S: 64 bits, length of payload (unsigned)
> 
> Maybe, 64bits is too much for payload. But who knows. And it's good that it's 
> symmetric to 64bit length in request.

Indeed, both qemu and libnbd implementations explicitly kill the
connection to any server that replies with more than the max buffer
used for NBD_CMD_READ/WRITE (32M for qemu, 64M for libnbd).  And if
the spec is not already clear on the topic, I should add an
independent patch to NBD_CMD_BLOCK_STATUS to make it obvious that a
server cannot reply with too many extents because of such clients.

So none of my proof-of-concept code ever used the full 64-bits of the
reply header length.  On the other hand, there is indeed the symmetry
argument - if someone writes a server willing to accept a 4G
NBD_CMD_WRITE, then it should also support a 4G NBD_CMD_READ, even if
no known existing server or client allows buffers that large..

> 
> > +S: 64 bits, padding (MUST be zero)
> 
> Hmm. Extra 8 bytes to be power-of-2. Does 32 bytes really perform better than 
> 24 bytes?

Consider:
struct header[100];

if sizeof(header[0]) is a power of 2 <= the cache line size (and the
compiler prefers to start arrays aligned to the cache line) then we
are guaranteed that all array members each reside in a single cache
line.  But if it is not a power of 2, some of the array members
straddle two cache lines.

Will there be code that wants to create an array of headers?  Perhaps
so, because that is a logical way (along with scatter/gather to
combine the header with 

Re: [PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-06 Thread Vladimir Sementsov-Ogievskiy

04.12.2021 02:14, Eric Blake wrote:

Add a new negotiation feature where the client and server agree to use
larger packet headers on every packet sent during transmission phase.
This has two purposes: first, it makes it possible to perform
operations like trim, write zeroes, and block status on more than 2^32
bytes in a single command; this in turn requires that some structured
replies from the server also be extended to match.  The wording chosen
here is careful to permit a server to use either flavor in its reply
(that is, a request less than 32-bits can trigger an extended reply,
and conversely a request larger than 32-bits can trigger a compact
reply).

Second, when structured replies are active, clients have to deal with
the difference between 16- and 20-byte headers of simple
vs. structured replies, which impacts performance if the client must
perform multiple syscalls to first read the magic before knowing how
many additional bytes to read.  In extended header mode, all headers
are the same width, so the client can read a full header before
deciding whether the header describes a simple or structured reply.
Similarly, by having extended mode use a power-of-2 sizing, it becomes
easier to manipulate headers within a single cache line, even if it
requires padding bytes sent over the wire.  However, note that this
change only affects the headers; as data payloads can still be
unaligned (for example, a client performing 1-byte reads or writes),
we would need to negotiate yet another extension if we wanted to
ensure that all NBD transmission packets started on an 8-byte boundary
after option haggling has completed.

This spec addition was done in parallel with a proof of concept
implementation in qemu (server and client) and libnbd (client), and I
also have plans to implement it in nbdkit (server).

Signed-off-by: Eric Blake 
---

Available at https://repo.or.cz/nbd/ericb.git/shortlog/refs/tags/exthdr-v1

  doc/proto.md | 218 +--
  1 file changed, 177 insertions(+), 41 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 3a877a9..46560b6 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -295,6 +295,21 @@ reply is also problematic for error handling of the 
`NBD_CMD_READ`
  request.  Therefore, structured replies can be used to create a
  a context-free server stream; see below.

+The results of client negotiation also determine whether the client
+and server will utilize only compact requests and replies, or whether
+both sides will use only extended packets.  Compact messages are the
+default, but inherently limit single transactions to a 32-bit window
+starting at a 64-bit offset.  Extended messages make it possible to
+perform 64-bit transactions (although typically only for commands that
+do not include a data payload).  Furthermore, when structured replies
+have been negotiated, compact messages require the client to perform
+partial reads to determine which reply packet style (simple or
+structured) is on the wire before knowing the length of the rest of
+the reply, which can reduce client performance.  With extended
+messages, all packet headers have a fixed length of 32 bytes, and
+although this results in more traffic over the network due to padding,
+the resulting layout is friendlier for performance.
+
  Replies need not be sent in the same order as requests (i.e., requests
  may be handled by the server asynchronously), and structured reply
  chunks from one request may be interleaved with reply messages from
@@ -343,7 +358,9 @@ may be useful.

   Request message

-The request message, sent by the client, looks as follows:
+The compact request message, sent by the client when extended
+transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+looks as follows:

  C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)
  C: 16 bits, command flags
@@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
  C: 32 bits, length (unsigned)
  C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)

+If negotiation agreed on extended transactions with
+`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
+
+C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)
+C: 16 bits, command flags
+C: 16 bits, type
+C: 64 bits, handle
+C: 64 bits, offset (unsigned)
+C: 64 bits, length (unsigned)
+C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)
+
   Simple reply message

  The simple reply message MUST be sent by the server in response to all
  requests if structured replies have not been negotiated using
-`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a 
simple
-reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
-but only if the reply has no data payload.  The message looks as
-follows:
+`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
+negotiated, a simple reply MAY be used as a reply to any request other
+than `NBD_CMD_READ`, but only if 

[PATCH] spec: Add NBD_OPT_EXTENDED_HEADERS

2021-12-03 Thread Eric Blake
Add a new negotiation feature where the client and server agree to use
larger packet headers on every packet sent during transmission phase.
This has two purposes: first, it makes it possible to perform
operations like trim, write zeroes, and block status on more than 2^32
bytes in a single command; this in turn requires that some structured
replies from the server also be extended to match.  The wording chosen
here is careful to permit a server to use either flavor in its reply
(that is, a request less than 32-bits can trigger an extended reply,
and conversely a request larger than 32-bits can trigger a compact
reply).

Second, when structured replies are active, clients have to deal with
the difference between 16- and 20-byte headers of simple
vs. structured replies, which impacts performance if the client must
perform multiple syscalls to first read the magic before knowing how
many additional bytes to read.  In extended header mode, all headers
are the same width, so the client can read a full header before
deciding whether the header describes a simple or structured reply.
Similarly, by having extended mode use a power-of-2 sizing, it becomes
easier to manipulate headers within a single cache line, even if it
requires padding bytes sent over the wire.  However, note that this
change only affects the headers; as data payloads can still be
unaligned (for example, a client performing 1-byte reads or writes),
we would need to negotiate yet another extension if we wanted to
ensure that all NBD transmission packets started on an 8-byte boundary
after option haggling has completed.

This spec addition was done in parallel with a proof of concept
implementation in qemu (server and client) and libnbd (client), and I
also have plans to implement it in nbdkit (server).

Signed-off-by: Eric Blake 
---

Available at https://repo.or.cz/nbd/ericb.git/shortlog/refs/tags/exthdr-v1

 doc/proto.md | 218 +--
 1 file changed, 177 insertions(+), 41 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index 3a877a9..46560b6 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -295,6 +295,21 @@ reply is also problematic for error handling of the 
`NBD_CMD_READ`
 request.  Therefore, structured replies can be used to create a
 a context-free server stream; see below.

+The results of client negotiation also determine whether the client
+and server will utilize only compact requests and replies, or whether
+both sides will use only extended packets.  Compact messages are the
+default, but inherently limit single transactions to a 32-bit window
+starting at a 64-bit offset.  Extended messages make it possible to
+perform 64-bit transactions (although typically only for commands that
+do not include a data payload).  Furthermore, when structured replies
+have been negotiated, compact messages require the client to perform
+partial reads to determine which reply packet style (simple or
+structured) is on the wire before knowing the length of the rest of
+the reply, which can reduce client performance.  With extended
+messages, all packet headers have a fixed length of 32 bytes, and
+although this results in more traffic over the network due to padding,
+the resulting layout is friendlier for performance.
+
 Replies need not be sent in the same order as requests (i.e., requests
 may be handled by the server asynchronously), and structured reply
 chunks from one request may be interleaved with reply messages from
@@ -343,7 +358,9 @@ may be useful.

  Request message

-The request message, sent by the client, looks as follows:
+The compact request message, sent by the client when extended
+transactions are not negotiated using `NBD_OPT_EXTENDED_HEADERS`,
+looks as follows:

 C: 32 bits, 0x25609513, magic (`NBD_REQUEST_MAGIC`)  
 C: 16 bits, command flags  
@@ -353,14 +370,26 @@ C: 64 bits, offset (unsigned)
 C: 32 bits, length (unsigned)  
 C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  

+If negotiation agreed on extended transactions with
+`NBD_OPT_EXTENDED_HEADERS`, the client instead uses extended requests:
+
+C: 32 bits, 0x21e41c71, magic (`NBD_REQUEST_EXT_MAGIC`)  
+C: 16 bits, command flags  
+C: 16 bits, type  
+C: 64 bits, handle  
+C: 64 bits, offset (unsigned)  
+C: 64 bits, length (unsigned)  
+C: (*length* bytes of data if the request is of type `NBD_CMD_WRITE`)  
+
  Simple reply message

 The simple reply message MUST be sent by the server in response to all
 requests if structured replies have not been negotiated using
-`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been negotiated, a 
simple
-reply MAY be used as a reply to any request other than `NBD_CMD_READ`,
-but only if the reply has no data payload.  The message looks as
-follows:
+`NBD_OPT_STRUCTURED_REPLY`. If structured replies have been
+negotiated, a simple reply MAY be used as a reply to any request other
+than `NBD_CMD_READ`, but only if the reply has no data payload.