Re: [Nbd] [Qemu-devel] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-13 Thread Pavel Borzenkov
Hi Eric,

On Thu, Apr 07, 2016 at 10:10:58AM -0600, Eric Blake wrote:
> On 04/07/2016 04:38 AM, Vladimir Sementsov-Ogievskiy wrote:
> > On 05.04.2016 16:43, Paolo Bonzini wrote:
> >>
> >> On 05/04/2016 06:05, Kevin Wolf wrote:
> >>> The options I can think of is adding a request field "max number of
> >>> descriptors" or a flag "only single descriptor" (with the assumption
> >>> that clients always want one or unlimited), but maybe you have a better
> >>> idea.
> >> I think a limit is better.  Even if the client is ultimately going to
> >> process the whole file, it may take a very long time and space to
> >> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> >> a time, I think it's simpler to put a limit of 1024 descriptors or so.
> >>
> >> Paolo
> >>
> > 
> > I vote for the limit too. More over, I think, there should be two sides
> > limit:
> > 
> > 1. The client can specify the limit, so server should not return more
> > extents than requested. Of course, server should chose sequential
> > extents from the beginning of requested range.
> 
> For the client to request a limit would entail that we enhance the
> protocol to allow structured requests (where a wire-sniffer would know
> how many bytes to read for the client's additional data, even if it does
> not understand the extension's semantics).  Might not be a bad idea to
> have this in the long run, but so far I've been reluctant to bite the
> bullet.
> 
> > 2. Server side limit: if client asked too many extents or not specified
> > a limit at all, server should not return all extents, but only 1024 (for
> > ex.) from the beginning of the range.
> 
> Okay, I'm fairly convinced now that letting the server limit the reply
> is a good thing, and that one doesn't require a structured request from
> the client.  Since we just recently documented that strings should be no
> more than 4096 bytes, and my v2 proposal used 8 bytes per descriptor,
> maybe a good way to enforce a similar limit would be:
> 
> The server MAY choose to send fewer descriptors than what would describe
> the full extent of the client's request, but MUST send at least one
> descriptor unless an error is reported.  The server MUST NOT send more
> than 512 descriptors, even if that does not completely describe the
> client's requested length.
> 
> That way, a client in general should never expect more than ~4096 bytes
> + overhead on any server reply except a reply to NBD_CMD_READ, and can
> therefore utilize stack allocation for all other replies (if we do this,
> maybe we should make a hard rule that all future protocol extensions,
> other than NBD_CMD_READ, will guarantee that a reply has a bounded size)
> 
> I also think it may be okay to let the server reply with MORE data than
> the client requested, but only as long as it does not result in any
> extra descriptors (that is, only the last descriptor can result in a
> length beyond the client's request).  For example, if the client asks
> for block status of 1M of the file, but the server can conveniently
> learn via lseek(SEEK_HOLE) or other means that there are 2M of data
> before status changes, then there's no reason to force the server to
> throw away the information about the 1M beyond the client's read, and
> the client might even be able to be more efficient in later requests.
> 
> > 2.1 And/or, why not allow the server use the power of structured reply
> > and send several reply chunks? Why did you forbid this? (if I correctly
> > understand "This chunk type MUST appear at most once in a structured
> > reply.")
> 
> If we allow more than one chunk, then either every chunk has to include
> an offset (more traffic over the wire), or the chunks have to be sent in
> a particular order (we aren't gaining any benefits that NBD_CMD_READ
> gains by allowing out-of-order transmission).  It's also more work for
> the client to reconstruct if it has to reassemble; with NBD_CMD_READ,
> the payload is dominated by the data being read, and you can pwrite()
> the data into its final location as the client; but with
> NBD_CMD_BLOCK_STATUS, the payload is dominated by the metadata and we
> want to keep it minimal; and there is no convenient command for the
> client to reassemble the information if received out of order.
> 
> Allowing for a short reply seems to be worth doing, but allowing for
> multiple reply chunks seems not worth the risk.
> 
> I'm also starting to think that it is worth FIRST documenting an
> extension for advertising block sizes, so that we can then couch
> BLOCK_STATUS in those terms (a server MUST NOT subdivide status into
> finer granularity than the advertised block sizes).

Why do you need to operate with blocks instead of list of extents?
What benefits will this approach provide for a client or a server?

Are you still working on the spec? I can update the patch with
information about server-side limit/beyond request's length replies and
post v3, so that things keep moving forward.

-- 

Re: [Nbd] [PATCH v2] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-04-07 Thread Pavel Borzenkov
On Tue, Apr 05, 2016 at 03:43:08PM +0200, Paolo Bonzini wrote:
> 
> 
> On 05/04/2016 06:05, Kevin Wolf wrote:
> > The options I can think of is adding a request field "max number of
> > descriptors" or a flag "only single descriptor" (with the assumption
> > that clients always want one or unlimited), but maybe you have a better
> > idea.
> 
> I think a limit is better.  Even if the client is ultimately going to
> process the whole file, it may take a very long time and space to
> retrieve all the descriptors in one go.  Rather than query e.g. 16GB at
> a time, I think it's simpler to put a limit of 1024 descriptors or so.

Agree. I'm not sure that the value of the limit should be hard-coded
into the protocol, though. Why don't just allow a server to send less
data than requested (similar to what SCSI "GET LBA STATUS" allows) and
allow the server to choose the limit suitable for it (without directly
exposing it to the clients)?

> Paolo

--
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Pavel Borzenkov
On Thu, Mar 24, 2016 at 09:04:07AM -0600, Eric Blake wrote:
> On 03/24/2016 06:30 AM, Pavel Borzenkov wrote:
> >> Conversely, it would be possible to send less data over the wire, as
> >> long as we require that all LBA status descriptors cover consecutive
> >> offsets.  That is, having the server reply with offsets is pointless,
> >> since they can be reconstructed on the client by starting with the
> >> offset in the client's request, then adding length from each status
> >> field.  Is less network traffic desirable?
> > 
> > I think it's better to explicitly send the start of each LBA extent.
> 
> Why? Is the redundancy for something that the client can reconstruct
> worth the extra safety at the cost of more traffic?

On second thought, not sending offsets look better. Firstly, they are
really not required for client to process the reply. Secondly, it also
implies that the regions are distinct and in sorted order and makes it
impossible to do otherwise.

-- 
Pavel

> 
> > So I'll go with changing 'status' field to 32 bits to avoid
> > packing/unpacking issues.
> 
> You may want to do that even if you eliminate the offset field, so that
> you have 8 bytes per struct (instead of 16).
> 
> >>
> >> Do we want to require that the server MUST reply with enough extents to
> >> sum up to the length of the client's request, or are we permitting a
> >> short reply?
> > 
> > While the "GET LBA STATUS" command in SCSI allows partial reply, I
> > believe it'd better to keep things simple and require that the server
> > must either return a list of extents that covers the whole requested
> > range, or an error.
> 
> Make sure you specify whether ranges are allowed to overlap, or must be
> distinct (I prefer the latter), and whether they must appear in sorted
> order (which I also prefer) - once you mandate ordered and
> non-overlapping coverage, client-side validation that the server's
> answer makes sense is easier (remember, we want the client to detect
> man-in-the-middle corruption of the server's reply).
> 
> 
> > Actually, for this command I treat 'command flags' field not as a set of
> > flags, but rather as a plain number representing required mode of
> > operation. Probably, not a good idea as it doesn't match the rest of the
> > commands.
> > 
> > I went this way because I didn't want to allow clients to request
> > several modes simultaneously (e.g. provisioning + dirtiness) in the same
> > request. This makes server side implementation harder.
> > 
> > I think I'll just switch to bits to match the rest of the commands and
> > will add a note, that server should return EINVAL in case several modes
> > are requested simultaneously.
> 
> But you don't need two bits.  Just a single bit will do (off for
> provisioning mode, on for dirty mode).  So there are no conflicting
> modes that can be simultaneously requested, at least in the current
> definition of a single valid flag bit.  (Then again, I did make a
> suggestion about additional bits, useful only during provisioning mode,
> that might be used to state that the client is okay if the server
> coalesces extents that differ only in allocation or only in zeroed
> content - if we add that bit or two bits, it would be an error to use it
> while requesting dirty mode).
> 
> 
> >> then we can express four states:
> >>
> >> 0x0 - LBA extent not present, client MUST NOT make assumptions about
> >> contents, and reads should not be attempted
> >> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
> >> 0x2 - LBA extent not present, but client can treat the extent as zeroes
> >> and reads will succeed
> >> 0x3 - LBA extent present, client can treat the extent as zeroes and
> >> reads will succeed
> > 
> > I'm not sure that clients need this level of details. From client's POV
> > 0x2 and 0x3 are the same.
> 
> No, if the client is trying to EXACTLY copy sparseness, then 0x2 and 0x3
> differ on whether the client will punch a hole vs. explicitly allocate
> zeroes.
> 
> -- 
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 



--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Pavel Borzenkov
On Thu, Mar 24, 2016 at 09:41:29AM +0100, Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 11:25:52AM +0300, Pavel Borzenkov wrote:
> > On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote:
> > > Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> > > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > > > +the provisioning state of the device. The following provisionnig 
> > > > > states
> > > > > +are defined for the command:
> > > > > +
> > > > > +  - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the 
> > > > > block device;
> > > > > +  - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block 
> > > > > device
> > > > > +and contains zeroes;
> > > > 
> > > > Presumably this should be "contains only zeroes"?
> > > > 
> > > > Also, this may end up being a fairly expensive call for the server to
> > > > process. Is it really useful?
> > > 
> > > I think we need to make clear that this is meant as an optimisation and
> > > it's always a valid option for a server to return NBD_STATE_ALLOCATED
> > > even if the contents is zeroed.
> > > 
> > > It is definitely useful if the server has a means to efficiently find
> > > out the allocation status (e.g. SEEK_HOLE). In that case the client may
> > > be able to avoid reading the block and sending it over the network, or
> > > when making a copy, it could use it to keep the target file sparse. If
> > > the client can't take advantage, we didn't have much overhead, so it's
> > > fine.
> > 
> > Yes, that was the idea. I'll add a note that the server may return
> > NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to
> > efficiently differentiate allocated blocks with zeroes from allocated
> > blocks with non-zeroed content.
> 
> Okay, that alleviates my concerns.
> 
> In that case it might be useful if the server could say something along
> the lines of "I know it's allocated, but I didn't check whether there's
> anything non-zero in there"? The client can then decide to do nothing
> with that information; but the more useful information is sent along,
> the better...

Doesn't allocated state mean exactly this? E.g. it is allocated and I
have no idea what the content is.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>people in the world who think they really understand all of its rules,
>and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12

--
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351=/4140
___
Nbd-general mailing list
Nbd-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nbd-general


Re: [Nbd] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension

2016-03-24 Thread Pavel Borzenkov
On Wed, Mar 23, 2016 at 06:58:34PM +0100, Wouter Verhelst wrote:
> On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzen...@virtuozzo.com>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> > There exists a concept of data dirtiness, which is required during, for
> > example, incremental block device backup. To express this concept via
> > NBD protocol, this patch also adds additional mode of operation to
> > NBD_CMD_GET_LBA_STATUS command.
> > 
> > Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> > LBA STATUS" command more closely, it has been chosen to return a list of
> > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> > bitmap.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzen...@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <d...@openvz.org>
> > CC: Wouter Verhelst <w...@uter.be>
> > CC: Paolo Bonzini <pbonz...@redhat.com>
> > CC: Kevin Wolf <kw...@redhat.com>
> > CC: Stefan Hajnoczi <stefa...@redhat.com>
> > ---
> >  doc/proto.md | 82 
> > 
> >  1 file changed, 82 insertions(+)
> > 
> > diff --git a/doc/proto.md b/doc/proto.md
> > index cda213c..fff515d 100644
> > --- a/doc/proto.md
> > +++ b/doc/proto.md
> > @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle 
> > negotiation:
> >`NBD_CMD_TRIM` commands
> >  - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> >supports `NBD_CMD_WRITE_ZEROES` commands
> > +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server
> > +  supports `NBD_CMD_GET_LBA_STATUS` commands
> >  
> >  # Client flags
> >  
> > @@ -477,6 +479,10 @@ The following request types exist:
> >  
> >  Defined by the experimental `WRITE_ZEROES` extension; see below.
> >  
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +Defined by the experimental `GET_LBA_STATUS` extension; see below.
> > +
> >  * Other requests
> >  
> >  Some third-party implementations may require additional protocol
> > @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a 
> > write zeroes request
> >  including one or more sectors beyond the size of the device. It SHOULD
> >  return `EPERM` if it receives a write zeroes request on a read-only export.
> >  
> > +### `GET_LBA_STATUS` extension
> > +
> > +With the availability of sparse storage formats, it is often needed to 
> > query
> > +status of a particular LBA range and read only those blocks of data that 
> > are
> > +actually present on the block device.
> > +
> > +Some storage formats and operations over such formats express a concept of
> > +data dirtiness. Whether the operation is block device mirroring,
> > +incremental block device backup or any other operation with a concept of
> > +data dirtiness, they all share a need to provide a list of LBA ranges
> > +that this particular operation treats as dirty.
> > +
> > +To provide such class of information, `GET_LBA_STATUS` extension adds new
> > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> > +their respective states.
> > +
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +An LBA range status query request. Length and offset define the range
> > +of interest. The server MUST reply with a reply header, followed
> > +immediately by the following data:
> 
> As Eric noted, please expand LBA at least once.
> 
> > +  - 32 bits, length of parameter data that follow (unsigned)
> > +  - zero or more LBA status descriptors, each having the following
> > +structure:
> > +
> > +* 64 bits, offset (unsigned)
> > +* 32 bits, length (unsigned)
> > +* 16 bits, status (unsigned)
> > +
> > +unless an error condition has occurred.
> > +
> > +If an error occurs, the server SHOULD set the appropriate error code
> > +in the error field. The server MUST then either close the
> > +connection, or send *l

Re: [Nbd] [Qemu-devel] [PATCH 1/2] NBD proto: add WRITE_ZEROES extension

2016-03-24 Thread Pavel Borzenkov
On Wed, Mar 23, 2016 at 09:14:10AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzen...@virtuozzo.com>
> > 
> > There exist some cases when a client knows that the data it is going to
> > write is all zeroes. Such cases include mirroring or backing up a device
> > implemented by a sparse file.
> > 
> > With current NBD command set, the client has to issue NBD_CMD_WRITE
> > command with zeroed payload and transfer these zero bytes through the
> > wire. The server has to write the data onto disk, effectively denying
> > the sparseness.
> > 
> > To remedy this, the patch adds WRITE_ZEROES extension with one new
> > NBD_CMD_WRITE_ZEROES command.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzen...@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rka...@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <d...@openvz.org>
> > CC: Wouter Verhelst <w...@uter.be>
> > CC: Paolo Bonzini <pbonz...@redhat.com>
> > CC: Kevin Wolf <kw...@redhat.com>
> > CC: Stefan Hajnoczi <stefa...@redhat.com>
> > ---
> >  doc/proto.md | 44 
> >  1 file changed, 44 insertions(+)
> > 
> 
> >  
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +Defined by the experimental `WRITE_ZEROES` extension; see below.
> 
> If this patch goes in to the NBD sources, the extension is not
> experimental any more, right?  Shouldn't the patch be written under the
> assumption that it will be the final form of the text, even though the
> feature is experimental until then? [1]
> 
> > +### `WRITE_ZEROES` extension
> > +
> > +There exist some cases when a client knows that the data it is going to 
> > write
> > +is all zeroes. Such cases include mirroring or backing up a device 
> > implemented
> > +by a sparse file. With current NBD command set, the client has to issue
> > +`NBD_CMD_WRITE` command with zeroed payload and transfer these zero bytes
> > +through the wire. The server has to write the data onto disk, effectively
> > +denying the sparseness.
> 
> s/denying/losing/ ?
> 
> > +
> > +To remedy this, a `WRITE_ZEROES` extension is envisioned. This extension 
> > adds
> 
> s/is envisioned/exists/ - again, under the argument that once it is part
> of this document, it is no longer experimental.
> 
> /me goes and reads the actual proto.md file, and light bulb turns on...
> 
> [1] Oh, you ARE adding this to the "Experimental extensions" section of
> the document, so your wording IS correct.  I guess the idea is that we
> write up the documentation in the experimental section, tweak qemu to
> implement it both as NBD client and as NBD server (since qemu has code
> that can serve in both positions), see how well it worked, and THEN do a
> followup patch to proto.md to move the text into the non-experimental
> section, along with any tweaks learned during the qemu patches.

Yes, that is the plan.

> > +one new command with two command flags.
> > +
> > +* `NBD_CMD_WRITE_ZEROES` (6)
> > +
> > +A write request with no payload. Length and offset define the location
> > +and amount of data to be zeroed.
> > +
> > +The server MUST zero out the data on disk, and then send the reply
> > +message. The server MAY send the reply message before the data has
> > +reached permanent storage.
> > +
> > +If the `NBD_FLAG_SEND_FUA` flag ("Force Unit Access") was set in the
> > +export flags field, the client MAY set the flag `NBD_CMD_FLAG_FUA` 
> > (bit 0)
> > +in the command flags field. If this flag was set, the server MUST NOT 
> > send
> > +the reply until it has ensured that the newly-zeroed data has reached
> > +permanent storage.
> 
> Do we want to add:
> 
> The server SHOULD return EINVAL if the client set NBD_CMD_FLAG_FUA but
> the export flags did not include NBD_FLAG_SEND_FUA.

Looks like a good idea. Needs to be added here and in NBD_CMD_WRITE
command description.

> 
> > +
> > +If the flag `NBD_CMD_FLAG_MAY_TRIM` (bit 1) was set by the client in 
> > the
> > +command flags field, the server MAY use trimming to zero out the area,
> > +but it MUST ensure that the data reads back as zero.
> 
> Bug in the existing spec: The constant NBD_CMD_FLAG_FUA is mentioned,
> but never defined with a fixed value.  Your text above defines it to
> 'bit 0' in the command flags field - is that correct?  If so, should we
> add a section to the document that lists the bit values of a