Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-04-04 Thread Eric Blake
On 04/04/2016 04:18 AM, Markus Pargmann wrote:
> Hi,
> 
> back from my easter vacation. A bit surprised to find 200 mails in the
> NBD mailing list ;).
> 

>> Yes. This has been discussed on the nbd-general list in the past. There
>> is also the (significant) problem of the server having maybe already
>> sent out the header before discovering there is an error, at which point
>> it can *only* drop the connection.
> 
> I am still not through all the new mails on the list, so there may be
> some more discussions about this. But I will answer here simply.
> 
> I generally like the idea of having this new kind of reply. Is this
> solving our issues where we want to "stream" data directly from a
> filedescriptor into a tcp-stream?

I'm a relative newcomer on the NBD list, so I'm not sure what the issue
was there, but yes, structured replies DOES help with read semantics
that encounter a partial error that can be detected only after you have
started the reply stream.

> 
> Does it make sense to extend this for reads with an offset? This way we
> could not only send in chunks but also order them randomly. Is there any
> use-case where it does make sense to read data not sequentially?

In fact, that's what already got committed while you were gone.  It may
help if you jump in and read the current state of proto.md, if you don't
want to plow through all the mail churn in the meantime for how we got
to the state committed.  And of course, if you have suggestions on how
to further improve it, the extension is still documented as
experimental, so we can further tweak it before releasing upstream NBD
or downstream QEMU implementations of the extension (I'm currently
coding up the work on a qemu implementation, and will report on anything
that I run into during that process).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-04-04 Thread Eric Blake
On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov 
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
> 
> Signed-off-by: Pavel Borzenkov 
> Reviewed-by: Roman Kagan 
> Signed-off-by: Denis V. Lunev 
> CC: Wouter Verhelst 
> CC: Paolo Bonzini 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> ---
>  doc/proto.md | 82 
> 
>  1 file changed, 82 insertions(+)

I've posted a v2 of this proposal under a new title, rebased on top of
the recent work to add structured replies, and trying to take into
account a number of the suggestions that occurred in this thread.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-04-04 Thread Markus Pargmann
Hi,

back from my easter vacation. A bit surprised to find 200 mails in the
NBD mailing list ;).

On Friday 25 March 2016 09:49:29 Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote:
> > On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > > From: Pavel Borzenkov 
> > > 
> > > With the availability of sparse storage formats, it is often needed to
> > > query status of a particular LBA range and read only those blocks of
> > > data that are actually present on the block device.
> > > 
> > > To provide such information, the patch adds GET_LBA_STATUS extension
> > > with one new NBD_CMD_GET_LBA_STATUS command.
> > > 
> > 
> > > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > > +
> > > +An LBA range status query request. Length and offset define the range
> > > +of interest. The server MUST reply with a reply header, followed
> > > +immediately by the following data:
> > > +
> > > +  - 32 bits, length of parameter data that follow (unsigned)
> > > +  - zero or more LBA status descriptors, each having the following
> > > +structure:
> > > +
> > > +* 64 bits, offset (unsigned)
> > > +* 32 bits, length (unsigned)
> > > +* 16 bits, status (unsigned)
> > > +
> > > +unless an error condition has occurred.
> > 
> > To date, only the NBD_CMD_READ command caused the server to send data
> > after the handle in its reply.  This would be the second command with a
> > data field in the response, but it is returning a variable amount of
> > data, not directly tied to the client's length - but at least you did
> > make it structured so that the client knows how much to read.  However,
> > your patch is incomplete; you'll need to edit the "Transmission" section
> > of the document to mention the rules on the server sending data, as the
> > existing text would now be wrong:
> > 
> > > The server replies with:
> > > 
> > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> > > S: 32 bits, error
> > > S: 64 bits, handle
> > > S: (length bytes of data if the request is of type NBD_CMD_READ)
> > 
> > You may also want to add a rule that for all future extensions, any
> > command that requires data in the server response (other than the server
> > response to NBD_CMD_READ) must include a 32-bit length as the first
> > field of its data payload, so that the server reply is always structured.
> 
> Right.
> 
> > Hmm, I still think it would be hard to write a wireshark dissector for
> > server replies, since there is no indication whether a given reply will
> > include data or not.
> 
> Well, a wireshark dissector already exists. However, it is very old; it
> doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> deal with negotiation at all. It was written when newstyle negotiation
> didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> according to the guy who wrote it (don't remember who that was).
> 
> > The _client_ knows (well, any well-written client
> > that uses a different value for 'handle' for every command sent to the
> > server), because it can read the returned 'handle' field to know what
> > command it matches to and therefore what data to expect; but a
> > third-party observer seeing _just_ the server messages has no idea which
> > server responses should have payload.
> 
> It can if it knows enough about the protocol, but granted, that makes it
> harder for us to extend the protocol without breaking the dissector.
> 
> > Scanning the stream and assuming that a new reply starts each time
> > NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> > false positives if the data being returned for NBD_CMD_READ contains
> > the same magic number as part of its contents.
> 
> Indeed.
> 
> > Obviously, back-compat says we can't change the response to
> > NBD_CMD_READ, but that means that a wireshark dissector has to either
> > maintain context, or hunt through returned data looking for potential
> > magic numbers and possibly hitting false positives.
> >
> > That said, maybe we want to allow global flag negotiation prior to
> > transmission to add a new flag on both server and client side - the
> > server would advertise that it is capable of a new reply mode, and the
> > client then has to reply that it wants to use the new reply mode; in
> 
> Global flag negotiation is already possible. There is a client flags
> field, which is sent before option haggling, that could be used for
> negotiation of such backwards-incompatible features.
> 
> > that mode, all server replies starting with magic 0x67446698
> > (NBD_REPLY_MAGIC) will never have a data payload, and all commands that
> > cause a reply with payload (both NBD_CMD_READ and the new
> > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
> > will reply with a NEW magic number:
> > 
> > S: 32 bits, , magic (NBD_REPLY_MAGIC2)
> > S: 

Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-29 Thread Wouter Verhelst
On Tue, Mar 29, 2016 at 11:38:35AM +0200, Kevin Wolf wrote:
> Am 24.03.2016 um 17:47 hat Wouter Verhelst geschrieben:
> > On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> > > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > > > >> On 24/03/2016 16:25, Eric Blake wrote:
> > > >  However, let's make these bits, so that
> > > > 
> > > >  NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block 
> > > >  device
> > > >  NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > > > >>>
> > > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 
> > > > >>> means
> > > > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > > > >>> safe default?
> > > > >>
> > > > >> Double negations are evil (and don't work the same in all 
> > > > >> languages), so
> > > > >> I think it's a worse option.
> > > > > 
> > > > > I agree that a bit which says "unallocated" is confusing in that 
> > > > > manner,
> > > > > but that just means we need a better name (one that doesn't contain
> > > > > "un-" or "not")
> > > > > 
> > > > > I like the idea of having zero be the "sensible" default, although I'm
> > > > > quite unable to come up with a better name myself.
> > > > 
> > > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > > > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > > > such a state.
> > > 
> > > How about NBD_STATE_HOLE?
> > 
> > Both will work, although I like NBD_STATE_TRIM slightly better because
> > it indeed nicely references NBD_CMD_TRIM.
> 
> I just thought that "trim" sounds more like an action than a status, and
> while the reason for a hole to exist can be a previous TRIM command,
> another option is that it's an area in an image that just has never been
> written to. In that case "trim" would be a misnomer.

Point. It could be "TRIMMED" instead, I suppose.

> > However, I also think it should then be made clear that issuing
> > NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
> > that region if the backend storage format dosn't support that, to avoid
> > confusion later on.
> 
> Good point. That might be another reason for not calling the status
> "trim".

Also a good point...

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12

--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471=/4140
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-29 Thread Kevin Wolf
Am 24.03.2016 um 17:47 hat Wouter Verhelst geschrieben:
> On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > > >> On 24/03/2016 16:25, Eric Blake wrote:
> > >  However, let's make these bits, so that
> > > 
> > >  NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> > >  NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > > >>>
> > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > > >>> safe default?
> > > >>
> > > >> Double negations are evil (and don't work the same in all languages), 
> > > >> so
> > > >> I think it's a worse option.
> > > > 
> > > > I agree that a bit which says "unallocated" is confusing in that manner,
> > > > but that just means we need a better name (one that doesn't contain
> > > > "un-" or "not")
> > > > 
> > > > I like the idea of having zero be the "sensible" default, although I'm
> > > > quite unable to come up with a better name myself.
> > > 
> > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > > such a state.
> > 
> > How about NBD_STATE_HOLE?
> 
> Both will work, although I like NBD_STATE_TRIM slightly better because
> it indeed nicely references NBD_CMD_TRIM.

I just thought that "trim" sounds more like an action than a status, and
while the reason for a hole to exist can be a previous TRIM command,
another option is that it's an area in an image that just has never been
written to. In that case "trim" would be a misnomer.

> However, I also think it should then be made clear that issuing
> NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
> that region if the backend storage format dosn't support that, to avoid
> confusion later on.

Good point. That might be another reason for not calling the status
"trim".

Kevin


pgprRNVPKeakT.pgp
Description: PGP signature
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471=/4140___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-25 Thread Alex Bligh

On 25 Mar 2016, at 08:49, Wouter Verhelst  wrote:

> Yes. This has been discussed on the nbd-general list in the past. There
> is also the (significant) problem of the server having maybe already
> sent out the header before discovering there is an error, at which point
> it can *only* drop the connection.

Indeed.

I think where we got to last time this was discussed was that the
server could have the option of returning 'chunks' of reply, each
chunk either being a {length, data} tuple, or an error. The total
data transmitted would add up to the full length if there is no
error.

>> I could write up a negotiation of global flags for structured reply
>> lengths as an extension proposal, if you think it is worth it.
> 
> I think it is worth it...

+1 - for NBD_CMD_READ too

-- 
Alex Bligh





--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-25 Thread Wouter Verhelst
On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov 
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> 
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +An LBA range status query request. Length and offset define the range
> > +of interest. The server MUST reply with a reply header, followed
> > +immediately by the following data:
> > +
> > +  - 32 bits, length of parameter data that follow (unsigned)
> > +  - zero or more LBA status descriptors, each having the following
> > +structure:
> > +
> > +* 64 bits, offset (unsigned)
> > +* 32 bits, length (unsigned)
> > +* 16 bits, status (unsigned)
> > +
> > +unless an error condition has occurred.
> 
> To date, only the NBD_CMD_READ command caused the server to send data
> after the handle in its reply.  This would be the second command with a
> data field in the response, but it is returning a variable amount of
> data, not directly tied to the client's length - but at least you did
> make it structured so that the client knows how much to read.  However,
> your patch is incomplete; you'll need to edit the "Transmission" section
> of the document to mention the rules on the server sending data, as the
> existing text would now be wrong:
> 
> > The server replies with:
> > 
> > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> > S: 32 bits, error
> > S: 64 bits, handle
> > S: (length bytes of data if the request is of type NBD_CMD_READ)
> 
> You may also want to add a rule that for all future extensions, any
> command that requires data in the server response (other than the server
> response to NBD_CMD_READ) must include a 32-bit length as the first
> field of its data payload, so that the server reply is always structured.

Right.

> Hmm, I still think it would be hard to write a wireshark dissector for
> server replies, since there is no indication whether a given reply will
> include data or not.

Well, a wireshark dissector already exists. However, it is very old; it
doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
deal with negotiation at all. It was written when newstyle negotiation
didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
according to the guy who wrote it (don't remember who that was).

> The _client_ knows (well, any well-written client
> that uses a different value for 'handle' for every command sent to the
> server), because it can read the returned 'handle' field to know what
> command it matches to and therefore what data to expect; but a
> third-party observer seeing _just_ the server messages has no idea which
> server responses should have payload.

It can if it knows enough about the protocol, but granted, that makes it
harder for us to extend the protocol without breaking the dissector.

> Scanning the stream and assuming that a new reply starts each time
> NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> false positives if the data being returned for NBD_CMD_READ contains
> the same magic number as part of its contents.

Indeed.

> Obviously, back-compat says we can't change the response to
> NBD_CMD_READ, but that means that a wireshark dissector has to either
> maintain context, or hunt through returned data looking for potential
> magic numbers and possibly hitting false positives.
>
> That said, maybe we want to allow global flag negotiation prior to
> transmission to add a new flag on both server and client side - the
> server would advertise that it is capable of a new reply mode, and the
> client then has to reply that it wants to use the new reply mode; in

Global flag negotiation is already possible. There is a client flags
field, which is sent before option haggling, that could be used for
negotiation of such backwards-incompatible features.

> that mode, all server replies starting with magic 0x67446698
> (NBD_REPLY_MAGIC) will never have a data payload, and all commands that
> cause a reply with payload (both NBD_CMD_READ and the new
> NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
> will reply with a NEW magic number:
> 
> S: 32 bits, , magic (NBD_REPLY_MAGIC2)
> S: 32 bits, error
> S: 64 bits, handle
> S: 32 bits, length
> S: length bytes of data

I like this. However, before committing to it, I would like to see
Markus' feedback on that (explicit Cc added, even though he's on the
list, too).

We'd also need a way to communicate the ability to speak this protocol
from the kernel to userspace before telling the 

Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Eric Blake
On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov 
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 

> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +An LBA range status query request. Length and offset define the range
> +of interest. The server MUST reply with a reply header, followed
> +immediately by the following data:
> +
> +  - 32 bits, length of parameter data that follow (unsigned)
> +  - zero or more LBA status descriptors, each having the following
> +structure:
> +
> +* 64 bits, offset (unsigned)
> +* 32 bits, length (unsigned)
> +* 16 bits, status (unsigned)
> +
> +unless an error condition has occurred.

To date, only the NBD_CMD_READ command caused the server to send data
after the handle in its reply.  This would be the second command with a
data field in the response, but it is returning a variable amount of
data, not directly tied to the client's length - but at least you did
make it structured so that the client knows how much to read.  However,
your patch is incomplete; you'll need to edit the "Transmission" section
of the document to mention the rules on the server sending data, as the
existing text would now be wrong:

> The server replies with:
> 
> S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> S: 32 bits, error
> S: 64 bits, handle
> S: (length bytes of data if the request is of type NBD_CMD_READ)

You may also want to add a rule that for all future extensions, any
command that requires data in the server response (other than the server
response to NBD_CMD_READ) must include a 32-bit length as the first
field of its data payload, so that the server reply is always structured.

Hmm, I still think it would be hard to write a wireshark dissector for
server replies, since there is no indication whether a given reply will
include data or not.  The _client_ knows (well, any well-written client
that uses a different value for 'handle' for every command sent to the
server), because it can read the returned 'handle' field to know what
command it matches to and therefore what data to expect; but a
third-party observer seeing _just_ the server messages has no idea which
server responses should have payload.  Scanning the stream and assuming
that a new reply starts each time NBD_REPLY_MAGIC is encountered is a
close approximation, but hits false positives if the data being returned
for NBD_CMD_READ contains the same magic number as part of its contents.
 Obviously, back-compat says we can't change the response to
NBD_CMD_READ, but that means that a wireshark dissector has to either
maintain context, or hunt through returned data looking for potential
magic numbers and possibly hitting false positives.

That said, maybe we want to allow global flag negotiation prior to
transmission to add a new flag on both server and client side - the
server would advertise that it is capable of a new reply mode, and the
client then has to reply that it wants to use the new reply mode; in
that mode, all server replies starting with magic 0x67446698
(NBD_REPLY_MAGIC) will never have a data payload, and all commands that
cause a reply with payload (both NBD_CMD_READ and the new
NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
will reply with a NEW magic number:

S: 32 bits, , magic (NBD_REPLY_MAGIC2)
S: 32 bits, error
S: 64 bits, handle
S: 32 bits, length
S: length bytes of data

so that the server's data stream is fully structured without having to
maintain context of the client's requests.  That is, a dissector can now
merely scan for both magic numbers; and on a stream between a client and
server that have negotiated the new mode, the old magic number will
never have a payload, and the new magic number will always be
accompanied with a payload that describes how much data to read to the
boundary of the next reply.

For that matter, right now, NBD_CMD_READ is required to either end the
session or return length bytes of data even when error is non-zero (but
where those bytes MAY be invalid); but by adding a negotiated flag for
structured length replies, it would be possible to allow for short reads
and/or returning an error with 0 bytes of payload but still keeping the
connection to the client open, without having to send wasted bytes over
the wire.

I could write up a negotiation of global flags for structured reply
lengths as an extension proposal, if you think it is worth it.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--

Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Wouter Verhelst
On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > >> On 24/03/2016 16:25, Eric Blake wrote:
> >  However, let's make these bits, so that
> > 
> >  NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> >  NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > >>>
> > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > >>> safe default?
> > >>
> > >> Double negations are evil (and don't work the same in all languages), so
> > >> I think it's a worse option.
> > > 
> > > I agree that a bit which says "unallocated" is confusing in that manner,
> > > but that just means we need a better name (one that doesn't contain
> > > "un-" or "not")
> > > 
> > > I like the idea of having zero be the "sensible" default, although I'm
> > > quite unable to come up with a better name myself.
> > 
> > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > such a state.
> 
> How about NBD_STATE_HOLE?

Both will work, although I like NBD_STATE_TRIM slightly better because
it indeed nicely references NBD_CMD_TRIM.

However, I also think it should then be made clear that issuing
NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
that region if the backend storage format dosn't support that, to avoid
confusion later on.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12


signature.asc
Description: PGP signature
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Pavel Borzenkov
On Thu, Mar 24, 2016 at 09:04:07AM -0600, Eric Blake wrote:
> On 03/24/2016 06:30 AM, Pavel Borzenkov wrote:
> >> Conversely, it would be possible to send less data over the wire, as
> >> long as we require that all LBA status descriptors cover consecutive
> >> offsets.  That is, having the server reply with offsets is pointless,
> >> since they can be reconstructed on the client by starting with the
> >> offset in the client's request, then adding length from each status
> >> field.  Is less network traffic desirable?
> > 
> > I think it's better to explicitly send the start of each LBA extent.
> 
> Why? Is the redundancy for something that the client can reconstruct
> worth the extra safety at the cost of more traffic?

On second thought, not sending offsets look better. Firstly, they are
really not required for client to process the reply. Secondly, it also
implies that the regions are distinct and in sorted order and makes it
impossible to do otherwise.

-- 
Pavel

> 
> > So I'll go with changing 'status' field to 32 bits to avoid
> > packing/unpacking issues.
> 
> You may want to do that even if you eliminate the offset field, so that
> you have 8 bytes per struct (instead of 16).
> 
> >>
> >> Do we want to require that the server MUST reply with enough extents to
> >> sum up to the length of the client's request, or are we permitting a
> >> short reply?
> > 
> > While the "GET LBA STATUS" command in SCSI allows partial reply, I
> > believe it'd better to keep things simple and require that the server
> > must either return a list of extents that covers the whole requested
> > range, or an error.
> 
> Make sure you specify whether ranges are allowed to overlap, or must be
> distinct (I prefer the latter), and whether they must appear in sorted
> order (which I also prefer) - once you mandate ordered and
> non-overlapping coverage, client-side validation that the server's
> answer makes sense is easier (remember, we want the client to detect
> man-in-the-middle corruption of the server's reply).
> 
> 
> > Actually, for this command I treat 'command flags' field not as a set of
> > flags, but rather as a plain number representing required mode of
> > operation. Probably, not a good idea as it doesn't match the rest of the
> > commands.
> > 
> > I went this way because I didn't want to allow clients to request
> > several modes simultaneously (e.g. provisioning + dirtiness) in the same
> > request. This makes server side implementation harder.
> > 
> > I think I'll just switch to bits to match the rest of the commands and
> > will add a note, that server should return EINVAL in case several modes
> > are requested simultaneously.
> 
> But you don't need two bits.  Just a single bit will do (off for
> provisioning mode, on for dirty mode).  So there are no conflicting
> modes that can be simultaneously requested, at least in the current
> definition of a single valid flag bit.  (Then again, I did make a
> suggestion about additional bits, useful only during provisioning mode,
> that might be used to state that the client is okay if the server
> coalesces extents that differ only in allocation or only in zeroed
> content - if we add that bit or two bits, it would be an error to use it
> while requesting dirty mode).
> 
> 
> >> then we can express four states:
> >>
> >> 0x0 - LBA extent not present, client MUST NOT make assumptions about
> >> contents, and reads should not be attempted
> >> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
> >> 0x2 - LBA extent not present, but client can treat the extent as zeroes
> >> and reads will succeed
> >> 0x3 - LBA extent present, client can treat the extent as zeroes and
> >> reads will succeed
> > 
> > I'm not sure that clients need this level of details. From client's POV
> > 0x2 and 0x3 are the same.
> 
> No, if the client is trying to EXACTLY copy sparseness, then 0x2 and 0x3
> differ on whether the client will punch a hole vs. explicitly allocate
> zeroes.
> 
> -- 
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 



--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Kevin Wolf
Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> >> On 24/03/2016 16:25, Eric Blake wrote:
>  However, let's make these bits, so that
> 
>  NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
>  NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> >>>
> >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> >>> allocated, 1 means not present), so that an overall status of 0 is a
> >>> safe default?
> >>
> >> Double negations are evil (and don't work the same in all languages), so
> >> I think it's a worse option.
> > 
> > I agree that a bit which says "unallocated" is confusing in that manner,
> > but that just means we need a better name (one that doesn't contain
> > "un-" or "not")
> > 
> > I like the idea of having zero be the "sensible" default, although I'm
> > quite unable to come up with a better name myself.
> 
> NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> matches well that we have NBD_CMD_TRIM for requesting the creation of
> such a state.

How about NBD_STATE_HOLE?

Kevin


pgpmh71rMTY2T.pgp
Description: PGP signature
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Wouter Verhelst
On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> On 24/03/2016 16:25, Eric Blake wrote:
> >> However, let's make these bits, so that
> >>
> >> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> >> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > 
> > Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > allocated, 1 means not present), so that an overall status of 0 is a
> > safe default?
> 
> Double negations are evil (and don't work the same in all languages), so
> I think it's a worse option.

I agree that a bit which says "unallocated" is confusing in that manner,
but that just means we need a better name (one that doesn't contain
"un-" or "not")

I like the idea of having zero be the "sensible" default, although I'm
quite unable to come up with a better name myself.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12


signature.asc
Description: PGP signature
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Paolo Bonzini
On 24/03/2016 16:25, Eric Blake wrote:
>> However, let's make these bits, so that
>>
>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> 
> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> allocated, 1 means not present), so that an overall status of 0 is a
> safe default?

Double negations are evil (and don't work the same in all languages), so
I think it's a worse option.

>> An implementation that doesn't track the "dirtiness" state of blocks
>> MUST either fail this command with EINVAL, or mark all blocks as
>> dirty in the descriptor that it returns.
> 
> Is it feasible to return zero/allocated/dirty status all at the same
> time, or do we want to strictly require two different modes of
> operation?

I think we should differentiate them, because it makes sense to support
only one.

In particular, while it is more or less obvious that (in my proposal
above) a trivial implementation must return NBD_STATE_ALLOCATED, it is
quite weird to require a trivial implementation to return
NBD_STATE_ALLOCATED|NBD_STATE_DIRTY.

Paolo

> That is, if we are returning zero and allocated as two bits,
> can we also return a third bit for dirty/clean?  Should we flip the
> sense of the bit, where 0 means dirty and 1 means clean, again so that a
> server can always return a status of 0 as the safe default?
> 



signature.asc
Description: OpenPGP digital signature
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Eric Blake
On 03/24/2016 06:30 AM, Pavel Borzenkov wrote:
>> Conversely, it would be possible to send less data over the wire, as
>> long as we require that all LBA status descriptors cover consecutive
>> offsets.  That is, having the server reply with offsets is pointless,
>> since they can be reconstructed on the client by starting with the
>> offset in the client's request, then adding length from each status
>> field.  Is less network traffic desirable?
> 
> I think it's better to explicitly send the start of each LBA extent.

Why? Is the redundancy for something that the client can reconstruct
worth the extra safety at the cost of more traffic?

> So I'll go with changing 'status' field to 32 bits to avoid
> packing/unpacking issues.

You may want to do that even if you eliminate the offset field, so that
you have 8 bytes per struct (instead of 16).

>>
>> Do we want to require that the server MUST reply with enough extents to
>> sum up to the length of the client's request, or are we permitting a
>> short reply?
> 
> While the "GET LBA STATUS" command in SCSI allows partial reply, I
> believe it'd better to keep things simple and require that the server
> must either return a list of extents that covers the whole requested
> range, or an error.

Make sure you specify whether ranges are allowed to overlap, or must be
distinct (I prefer the latter), and whether they must appear in sorted
order (which I also prefer) - once you mandate ordered and
non-overlapping coverage, client-side validation that the server's
answer makes sense is easier (remember, we want the client to detect
man-in-the-middle corruption of the server's reply).


> Actually, for this command I treat 'command flags' field not as a set of
> flags, but rather as a plain number representing required mode of
> operation. Probably, not a good idea as it doesn't match the rest of the
> commands.
> 
> I went this way because I didn't want to allow clients to request
> several modes simultaneously (e.g. provisioning + dirtiness) in the same
> request. This makes server side implementation harder.
> 
> I think I'll just switch to bits to match the rest of the commands and
> will add a note, that server should return EINVAL in case several modes
> are requested simultaneously.

But you don't need two bits.  Just a single bit will do (off for
provisioning mode, on for dirty mode).  So there are no conflicting
modes that can be simultaneously requested, at least in the current
definition of a single valid flag bit.  (Then again, I did make a
suggestion about additional bits, useful only during provisioning mode,
that might be used to state that the client is okay if the server
coalesces extents that differ only in allocation or only in zeroed
content - if we add that bit or two bits, it would be an error to use it
while requesting dirty mode).


>> then we can express four states:
>>
>> 0x0 - LBA extent not present, client MUST NOT make assumptions about
>> contents, and reads should not be attempted
>> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
>> 0x2 - LBA extent not present, but client can treat the extent as zeroes
>> and reads will succeed
>> 0x3 - LBA extent present, client can treat the extent as zeroes and
>> reads will succeed
> 
> I'm not sure that clients need this level of details. From client's POV
> 0x2 and 0x3 are the same.

No, if the client is trying to EXACTLY copy sparseness, then 0x2 and 0x3
differ on whether the client will punch a hole vs. explicitly allocate
zeroes.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Eric Blake
On 03/24/2016 05:55 AM, Paolo Bonzini wrote:
>> As Eric noted, please expand LBA at least once.
> 
> Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS).

Yes, avoiding the term LBA and using BLOCK everywhere also nicely solves
the problem of introducing yet more terminology.

> 
>>> +  - 32 bits, length of parameter data that follow (unsigned)
>>> +  - zero or more LBA status descriptors, each having the following
>>> +structure:
>>> +
>>> +* 64 bits, offset (unsigned)
>>> +* 32 bits, length (unsigned)
>>> +* 16 bits, status (unsigned)
>>> +
>>> +unless an error condition has occurred.
>>> +
> 
> Can we just return one descriptor?  That would simplify the protocol a bit.

As in, the return is exactly one descriptor, consisting of:

* 32 bits, length (unsigned): must be > 0, <= the client's length
* 16 bits, status (unsigned): status of that block

Of course, it means more traffic. The nice part about returning an array
of descriptors is that I can learn the status of 1G of the file, even if
the file alternates every 512 bytes between extent status, in just one
client call. But returning only a single descriptor at a time means I'd
have to make 2M client calls to learn the same pattern of allocation.
Fortunately, in the common case, allocation patterns tend to not be that
disjoint.

On the other hand, returning only one descriptor at a time (for possibly
less length than the client requested) may be easier when using
lseek(SEEK_DATA/HOLE) as the mechanism for determining the bounds of
each extent, since the server only has to search once per command,
instead of dynamically construct the entire reply.

I don't have any strong opinions on which would be better, but it is
definitely food for thought.

> 
> However, let's make these bits, so that
> 
> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes

Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
allocated, 1 means not present), so that an overall status of 0 is a
safe default?  (That is, it should always be safe to state a sector is
allocated when it is not, and always safe to state a sector is not known
to read as zeroes even if that happens to be its contents - all that we
lose by reporting this safe default state is that the client will be
unable to optimize for skipping holes).

>> Either the spec should define what it means for a block to be in a dirty
>> state, or it should not talk about it.
> 
> Here is my attempt:
> 
> This command is meant to operate in tandem with other (non-NBD)
> channels to the server.  Generally, a "dirty" block is a block that
> has been written to by someone, but the exact meaning of "has been
> written" is left to the implementation.  For example, a virtual
> machine monitor could provide a (non-NBD) command to start tracking
> blocks written by the virtual machine.  A backup client then can
> connect to an NBD server provided by the virtual machine monitor
> and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual

s/only/to only/

> machine has changed.

s/changed/changed since it started tracking/

> 
> An implementation that doesn't track the "dirtiness" state of blocks
> MUST either fail this command with EINVAL, or mark all blocks as
> dirty in the descriptor that it returns.

Is it feasible to return zero/allocated/dirty status all at the same
time, or do we want to strictly require two different modes of
operation?  That is, if we are returning zero and allocated as two bits,
can we also return a third bit for dirty/clean?  Should we flip the
sense of the bit, where 0 means dirty and 1 means clean, again so that a
server can always return a status of 0 as the safe default?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature
--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-23 Thread Eric Blake
On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov 
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.

The acronym LBA is not used elsewhere in the NBD spec; should we spell
it out at least once?

> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
> 
> Signed-off-by: Pavel Borzenkov 
> Reviewed-by: Roman Kagan 
> Signed-off-by: Denis V. Lunev 
> CC: Wouter Verhelst 
> CC: Paolo Bonzini 
> CC: Kevin Wolf 
> CC: Stefan Hajnoczi 
> ---
>  doc/proto.md | 82 
> 
>  1 file changed, 82 insertions(+)
> 

>  
> +### `GET_LBA_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to query
> +status of a particular LBA range and read only those blocks of data that are
> +actually present on the block device.
> +
> +Some storage formats and operations over such formats express a concept of
> +data dirtiness. Whether the operation is block device mirroring,
> +incremental block device backup or any other operation with a concept of
> +data dirtiness, they all share a need to provide a list of LBA ranges
> +that this particular operation treats as dirty.
> +
> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> +their respective states.
> +
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +An LBA range status query request. Length and offset define the range
> +of interest. The server MUST reply with a reply header, followed
> +immediately by the following data:
> +
> +  - 32 bits, length of parameter data that follow (unsigned)

Is this length the number of descriptors, or the number of bytes
occupied by those descriptors?  It looks like bytes (that is, with the
current layout, this field should be a multiple of 14 unless an error is
returned and the data is bogus).

I guess 32 bits is sufficient: transmission commands are limited to
32-bit length, and we are unlikely to have more than one extent per 512
bytes of length, so even if we have a header for every single sector
(worst-case for alternating clean/dirty sectors), as long as the
smallest granularity of an extent is larger than the extent field, the
'length of parameter data' in bytes is still sufficient.

> +  - zero or more LBA status descriptors, each having the following

zero or more? [1]

> +structure:
> +
> +* 64 bits, offset (unsigned)
> +* 32 bits, length (unsigned)
> +* 16 bits, status (unsigned)

An array of these status descriptors is packed on the wire, while the
typical C layout of an array of these structures will have padding to
reach a nice 8-byte alignment.  Should 'status' be a 32-bit field, so
that clients and servers do not have to pack/unpack between 14 bytes on
the wire and 16 bytes in efficient array handling, at the expense of
more network traffic?

Conversely, it would be possible to send less data over the wire, as
long as we require that all LBA status descriptors cover consecutive
offsets.  That is, having the server reply with offsets is pointless,
since they can be reconstructed on the client by starting with the
offset in the client's request, then adding length from each status
field.  Is less network traffic desirable?

> +
> +unless an error condition has occurred.
> +
> +If an error occurs, the server SHOULD set the appropriate error code
> +in the error field. The server MUST then either close the
> +connection, or send *length of parameter data* bytes of data
> +(which MAY be invalid).
> +
> +The type of information required by the client is passed to server in the
> +command flags field. If the server does not implement requested type or
> +have no means to express it, it MUST NOT return an error, but instead 
> MUST
> +return a single LBA status descriptor with *offset* and *length* equal to
> +the *offset* and *length* from request, and *status* set to `0`.

[1] So in what situations will we ever return an array of zero status
fields? On an error?  Should we make it