Re: [Qemu-devel] drive-mirroring to nbd is failing with multiple parallel jobs (qemu 2.9 -> 2.11)

2018-02-16 Thread Alex Bligh

> On 15 Feb 2018, at 04:42, Wouter Verhelst <w...@uter.be> wrote:
> 
> 
> We already set the SO_KEEPALIVE socket option (at least nbd-server does;
> don't know about qemu) to make the kernel send out TCP-level keepalive
> probes. This happens only after two hours (by default), but it's
> something you can configure on your system if you need it to be lower.

+1 for just using SO_KEEPALIVE. I think I even submitted some (untested
and thus unmerged) patches for this.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH 1/1] nbd: increase maximum size of the PWRITE_ZERO request

2018-02-10 Thread Alex Bligh

> On 10 Feb 2018, at 18:43, Alex Bligh <a...@alex.org.uk> wrote:
> 
> So I think a reasonable logic for Qemu would be to try NBD_CMD_INFO and find 
> the maximum write size, and if that's unsupported use 0x (capping at 
> export size, or export size minus write offset).

Ur actually capping it at (2^16 - blocksize) would be the right thing to do 
(writes should be multiples of the block size).

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH 1/1] nbd: increase maximum size of the PWRITE_ZERO request

2018-02-10 Thread Alex Bligh

> On 8 Feb 2018, at 16:28, Eric Blake <ebl...@redhat.com> wrote:
> 
> On 02/08/2018 07:23 AM, Edgar Kaziakhmedov wrote:
>> Upstream NBD protocol implementation supports an efficient zero out
>> mechanism over the wire, along with the ability to check whether a
>> client allows using a hole.
>> Accordingly, since PWRITE_ZERO doesn't involve any payload on the wire,
>> increase a maximum size of the PWRITE_ZERO request up to 1Gb (aligned).
>> Moreover, such change will decrease the number of PWRITE_ZERO NBD commands
>> in comparison with the current 32M limit. The benefits of
>> the larger constraint can be examined in a block mirroring over NBD.
> 
> We've got a potential problem.  Unless you have out-of-band communication of 
> the maximum NBD_CMD_WRITE_ZEROES sizing (or if the NBD protocol is enhanced 
> to advertise that as an additional piece of block size information during 
> NBD_OPT_GO), then a client CANNOT assume that the server will accept a 
> request this large.  We MIGHT get lucky if all existing servers that accept 
> WRITE_ZEROES requests either act on large requests or reply with EINVAL but 
> do not outright drop the connection (which is different from servers that DO 
> outright drop the connection for an NBD_CMD_WRITE larger than 32M).  But I 
> don't know if that's how all servers behave, so sending a too-large 
> WRITE_ZEROES request may have the unintended consequence of killing the 
> connection.
> 
> I'm adding the NBD list; perhaps before accepting this into qemu, I should 
> revive my earlier attempt at codifying an NBD_OPT_GO info advertisement for 
> maximum trim/zero sizing, which would let clients have a guarantee that their 
> choice of sizing won't cause unexpected failures.

A couple of comments:

1. The length field is only 32 bits, so no writes more than 0x in 
length are going to work anyway :-)

2. I'm not sure the situation is as bad as you make out Eric. I think you've 
forgotten the NBD_OPT_INFO work and the conversation around that we had where 
we determined that servers not supporting NBD_OPT_INFO were already meant to 
support 'unlimited' size writes. Per the spec:

"If block size constraints have not been advertised or agreed on externally, 
then a client SHOULD assume a default minimum block size of 1, a preferred 
block size of 2^12 (4,096), and a maximum block size of the smaller of the 
export size or 0x (effectively unlimited)."

I read these to apply to all uses of 'length', but even if one argues it 
doesn't apply to NBD_CMD_WRITE_ZEROES because it doesn't have a payload, I 
think the rebuttal is that a server which supports NBD_CMD_WRITE of a given 
length must also support NBD_CMD_WRITE_ZEROES of that length.

So I think a reasonable logic for Qemu would be to try NBD_CMD_INFO and find 
the maximum write size, and if that's unsupported use 0x (capping at 
export size, or export size minus write offset).

-- 
Alex Bligh







Re: [Qemu-devel] [RFH PATCH 00/10] i386: hvf: miscellaneous cleanups

2017-10-03 Thread Alex Bligh

> On 3 Oct 2017, at 15:45, Paolo Bonzini <pbonz...@redhat.com> wrote:
> 
> Alex (both of them), Sergio, anyone else who can help?

Very interested in this (and thanks!) but it will be a while
before I have a sensible number of cycles available to play
with this one again.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [Qemu-block] How to online resize qemu disk with nbd protocol?

2017-01-23 Thread Alex Bligh

> On 23 Jan 2017, at 14:54, Eric Blake <ebl...@redhat.com> wrote:
> 
> Thoughts?

My main thought is that the purpose of the extension branches is
to discuss and come to a consensus over experimental extension protocols.
Whilst I think creating a branch should be a lightweight affair
(fine), we explicitly say people should know the implications
of using extensions in shipping code - specifically that the
specification may change!

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2017-01-20 Thread Alex Bligh

> On 20 Jan 2017, at 17:04, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> About extents: is 32bit length enough? We will have to send 4096 for empty 
> 16tb disk..

The nbd protocol uniformly uses 32 bit lengths (for better or for worse). This 
is baked into the specification heavily.

I'm not sure sending 4,096 items for an empty 16TB disk is any great hardship 
to be honest.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [Qemu-block] How to online resize qemu disk with nbd protocol?

2017-01-14 Thread Alex Bligh

> On 14 Jan 2017, at 14:48, Wouter Verhelst <w...@uter.be> wrote:
> 
> On Thu, Jan 12, 2017 at 06:56:42PM +0000, Alex Bligh wrote:
>> My preferred way to do this would be essentially to allow NBD_OPT_INFO
>> to be sent (wrapped appropriately) during the main transmission phase.
>> That would allow *any* INFO stuff to be reread.
> 
> Can you go into a bit more detail how you'd see that? It feels a bit
> awkward to me, at best; but I could be missing something.

Well, the idea would be something like: NBD_CMD_INFO, use the offset
specified as the the information type, and put the INFO reply
within the reply block. I guess the client doesn't know the
length of the reply so we'd have to use structured replies
or similar. Although looking through the current info types,
I can't see why today you'd want to reread anything other than
the length, so maybe this is useless futureproofing.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2017-01-13 Thread Alex Bligh

> On 13 Jan 2017, at 09:48, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> 12.01.2017 16:11, Alex Bligh wrote:
>>> On 12 Jan 2017, at 07:05, Vladimir Sementsov-Ogievskiy 
>>> <vsement...@virtuozzo.com> wrote:
>>> 
>>> Yes this is better. But is it actually needed to force contexts have some 
>>> safe default? If context wants it may define such default without this 
>>> requirement.. So, should it be requirement at all?
>> I've changed this to:
>> 
>> of the file), a server MAY reply with a single block status
>> descriptor with *length* matching the requested length, rather than
>> reporting the error; in this case the context MAY mandate the
>> status returned.
>> 
>> 
> 
> Hmm, I don't understand. So, it MAY mandate and therefore MAY NOT do it? And 
> what client should think, if server replies with one chunk matching the 
> request length and not mandate the status?

Some contexts may mandate a particular value (so for instance the allocation 
context might mandate 0).

Some contexts may not mandate a particular value, in which case the 
interpretation is dependent upon the context (just like any other status 
value). EG a context which returned an status of 7 if the range contained a 
prime number, and else 3, could carry on doing that.

As it doesn't make sense to interpret status returns without an understanding 
of the particular context, we might as well simply extend this to 'beyond the 
range' returns - as I think you pointed out!

-- 
Alex Bligh







Re: [Qemu-devel] [Qemu-block] [Nbd] How to online resize qemu disk with nbd protocol?

2017-01-12 Thread Alex Bligh

> On 12 Jan 2017, at 18:45, Eric Blake <ebl...@redhat.com> wrote:
> 
> On 01/12/2017 11:57 AM, Bob Chen wrote:
>> There might be a time window between the NBD server's resize and the
>> client's `re-read size` request. Is it safe?
> 
> For resize larger, it seems that it would be safe for the server to just
> remember the last size it has advertised to the client.  As long as the
> client doesn't request read/write to any offset beyond the
> last-advertised size (either the size the server gave at handshake, or
> the size that the server reported when the client last used the new
> 're-read size' command), there's no difference in behavior (and
> well-behaved clients fall in this group); if the client DOES try to
> access out-of-bounds with respect to the last known size, the server
> SHOULD reject the access, but MAY serve it instead.  A client that is
> unaware of the 're-read size' command will never learn that the server
> is now offering a larger image.
> 
> For resize smaller, things are a lot trickier - how do you block access
> to a portion of a file that the client still thinks exist, but which the
> server has truncated away?  Maybe the easy way out is to state that the
> server MUST NOT resize smaller.

I'm not sure why this is any harder than Qemu writing to a physical partition
that changes size. You need to change the size of things in order. To
resize smaller, you tell Qemu it's smaller, then resize the partition. To
resize larger, you resize the partition then tell Qemu it's larger.

This is incidentally exactly the same thing you do if you have a filing
system on a partition (or a filing system on a real nbd device - we could
support the same reread of geometry in kernel).

>> 
>> What about an active `resize` request from the client? Considering some NBD
>> servers might have the capability to do instant resizing, not applying to
>> LVM or host block device, of course...
> 
> And perhaps the 're-read size' command can serve a dual purpose.  Since
> all NBD commands already include space for size and offset parameters,
> we could define that if the client passes 0/0 for size and offset, it
> wants to read the server's current size; if it passes 0/non-zero, it is
> asking the server to resize to the new non-zero size (and the server's
> success or error response tells whether the resize happened).
> 
> Should I spend time turning this idea into a more formal spec, along the
> lines of other NBD extension proposals?

I'd support the reread bit. I'm less sure we ever want to push a new
size from the client. It seems a bit like a layering violation. Perhaps
I can be convinced.

My preferred way to do this would be essentially to allow NBD_OPT_INFO
to be sent (wrapped appropriately) during the main transmission phase.
That would allow *any* INFO stuff to be reread.

If it's just this (rather than a resize command) I guess it could
go into the NBD_OPT_INFO extension branch. If it's going to do
more than _INFO_, then I guess it needs its own extension.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] How to online resize qemu disk with nbd protocol?

2017-01-12 Thread Alex Bligh

> On 12 Jan 2017, at 14:43, Eric Blake <ebl...@redhat.com> wrote:
> 
> That's because the NBD protocol lacks a resize command.  You'd have to
> first get that proposed as an NBD extension before qemu could support it.

Actually the NBD protocol lacks a 'make a disk with size X' command,
let alone a resize command. The size of an NBD disk is (currently)
entirely in the hands of the server. What I think we'd really need
would be a 'reread size' command, and have the server capable of
supporting resizing. That would then work for readonly images too.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2017-01-12 Thread Alex Bligh

> On 12 Jan 2017, at 11:43, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> From "NBD_OPT_LIST_META_CONTEXT (9)":
>> The server MUST either reply with an error (for instance EINVAL if the 
>> option is not supported), or reply with a list of NBD_REP_META_CONTEXT 
>> replies followed by NBD_REP_ACK.:
> NBD_REP_ERR_UNSUP for the case in braces? Should it be normal option reply 
> packet?

Thanks, fixed.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2017-01-12 Thread Alex Bligh

> On 12 Jan 2017, at 07:05, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> Yes this is better. But is it actually needed to force contexts have some 
> safe default? If context wants it may define such default without this 
> requirement.. So, should it be requirement at all?

I've changed this to:

of the file), a server MAY reply with a single block status
descriptor with *length* matching the requested length, rather than
reporting the error; in this case the context MAY mandate the
    status returned.


-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2017-01-11 Thread Alex Bligh

> On 11 Jan 2017, at 15:31, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> >>> If an error occurs, the server SHOULD set the appropriate error code in 
> >>> the error field of an error chunk. However, if the error does not involve 
> >>> invalid usage (such as a request beyond the bounds of the file), a server 
> >>> MAY reply with a single block status descriptor with length matching the 
> >>> requested length, and status of 0 rather than reporting the error.
> - single block status descriptor for each context? Isn't it implementation 
> defined? Or we finally decided to force 0 status to be safe default for all 
> contexts? If it is so, it would be better to describe this separately. 
> However, personally, I'd prefer to not define contexts internal semantics at 
> all.

I think this is Wouter's wording, but I think 'a status appropriate to the 
context' would be better. Each context then needs to define what that is. 
Either that or 'the context's default status' and that should be in the 
definition of the context.

-- 
Alex Bligh







Re: [Qemu-devel] implementing architectural timers using QEMU timers

2017-01-09 Thread Alex Bligh

> On 9 Jan 2017, at 15:18, Max Filippov <jcmvb...@gmail.com> wrote:
> 
> Hello,
> 
> I'm trying to reimplement xtensa CCOUNT (cycle counter) and
> CCOMPARE (CCOUNT-based timer interrupts) using QEMU
> timers. That is CCOUNT value is derived from the
> QEMU_CLOCK_VIRTUAL clock and CCOMPARE interrupts are
> generated from the QEMU_CLOCK_VIRTUAL timer callbacks.
> The code is here:
>  https://github.com/OSLL/qemu-xtensa/commits/xtensa-ccount
> 
> I've got the following issues doing that:
> 
> - in non-icount mode I can often read CCOUNT and get a value
>  that is greater than programmed CCOMPARE value, which
>  means that QEMU timer must have been fired at that point, but
>  no sign of timer callback being called. That is timer callback
>  invocation lags behind due time.
> 
>  Is my understanding correct that there's no hard expectations
>  that firing of QEMU timers will be correctly sequenced with
>  readings of QEMU clock?

My understanding is that the intent of the API contract is that

* The timer will not fire earlier than the time requested (but
  can fire later)

* Timers on the same clock and context will fire in the order
  of their expiry time (with the ordering being undefined in
  the event the expiry times are equal).

Whether in practice any users make use of the second guarantee
above I don't know.

> 
> - I thought that could be improved in -icount mode, so I tried that.
>  It is better with -icount, but it's still not 100% accurate. That is
>  I was able to observe guest reading QEMU clock value that is
>  past QEMU timer deadline before that timer callback was
>  invoked.
> 
>  That sounds like a bug to me, is it?

Hmm... that would be a bug if it were guaranteed that the
guest reads are always perfectly in sync with the timer
itself. I don't know whether that is the case.

Even though I authored a large timer patch, I found the icount
stuff confusing.

> - when guest sets a timer and halts itself waiting for timer
>  interrupt with waiti opcode QEMU behaviour is very strange with
>  -icount: regardless of the programmed timeout QEMU waits for
>  about a second before it delivers interrupt, and, AFAICT,
>  interrupt delivery it is not caused by the corresponding CCOUNT
>  timer. I was able to track this issue down to the
>  qemu_clock_use_for_deadline function, i.e. always returning true
>  'fixes' that unwanted delay, but looking around the timer code
>  I've got the impression that that's not the correct fix.
> 
>  Any suggestions on how to fix that?

This could be someone using timerlistgroup_deadline_ns rather than
qemu_clock_deadline_ns_all.

The comment at the top of qemu_clock_deadline_ns_all says:

/* Calculate the soonest deadline across all timerlists attached
 * to the clock. This is used for the icount timeout so we
 * ignore whether or not the clock should be used in deadline
 * calculations.
 */

I'm wondering whether there is something up with that logic.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2016-12-29 Thread Alex Bligh

> On 27 Dec 2016, at 14:09, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> What was the reason for it? Why not to negotiate forced structured read 
> separately? Actually, this spec forces any server, which wants to implement 
> structured reply implement structured read too. But what if it don't want to? 
> If it only wants to implement BLOCK_STATUS?
> 
> So, what about changing it, to allow BLOCK_STATUS (or other future structured 
> replies) without structured read? Structured read is good only for sparse 
> formats, when BLOCK_STATUS is more global. I understand, that servers may 
> implement simple (and useless) one-chunk structured read, but I think that it 
> is better to fix the spec, to not provoke servers use such workaround.

In essence the current reply format is broken, because it does not provide a 
means of delivering an error mid reply AND expects fixed length replies. Block 
Status is the poster child for something that benefits from structured replies.

So, BLOCK_STATUS requires structured replies. We thus have the choice between:
* Allowing structured replies to be implemented on a per-command basis
* Mandating that if structured replies are implemented, they are implemented 
universally

The second option is far simpler, and is what the current structured reply 
extension sets out. It's also a good nudge to server implementers (me included) 
to implement structured replies.

Your assumption that structured replies are only good for sparse formats is 
incorrect. It's good for anything that wants to include sensible error 
handling. As a server author, error handling at the moment is a pain. Most of 
us ALREADY break up large read requests (to avoid malloc() of huge amounts of 
memory), so we already have a read loop. If we don't then the 'one chunk' 
method is perfectly acceptable. If I implemented structured replies at all, I'd 
be sharing the code between multiple users in any case.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-29 Thread Alex Bligh
Vladimir,

> On 26 Dec 2016, at 15:57, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
>> OK, so first of all, one of the changes I made earlier was that now
>> each of the commands carries a list of queries, the way you list
>> everything is not 'having a query that doesn't contain a namespace'
>> but rather doing a _LIST_ with no queries at all. But that's semantics
>> and orthogonal to the main point.
>> 
>> What I've proposed (and pushed - but feel free to alter it) is that
> 
> I think, an overview in "## Metadata querying" should be updated too.

Sorry, which bit needs updating?

The only bit in that more general section that mentions _LIST_ is:

> First, during negotiation, if the client wishes to query metadata
> during transmission, the client MUST select one or more metadata
> contexts with the NBD_OPT_SET_META_CONTEXT command. If needed, the
> client can use NBD_OPT_LIST_META_CONTEXT to list contexts that the
> server supports.

That still seems correct.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-29 Thread Alex Bligh

> On 28 Dec 2016, at 00:18, Wouter Verhelst <w...@uter.be> wrote:
> 
> On Mon, Dec 26, 2016 at 05:52:54PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> Shouldn't we add some flags to REP_META_CONTEXT, for client to be insure, is
>> returned string a direct context name or some kind of wildcard? Just a flags
>> field, with one flag defined for now: NBD_REP_META_CONTEXT_LEAF and others
>> reserved.
> 
> I think it should be up to the metadata context namespace definition to
> define which syntax represents a direct context name and which
> represents a wildcard (if the latter are supported).
> 
> A client which doesn't know what a given metadata context implements
> can't reasonably ask for information from that context anyway (since
> then the client wouldn't know what to do with the returned information),
> so it doesn't help much to add a flag here.

I agree.

Vladimir: if this isn't clear from the text, please suggest a change.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-17 Thread Alex Bligh

> On 17 Dec 2016, at 08:34, Wouter Verhelst <w...@uter.be> wrote:
> 
> I've therefore removed that restriction as well as the "255 bytes max"
> one that you added, since I don't think they make much sense. That
> doesn't mean I can't be convinced otherwise by good arguments, but
> they'd have to be very good ones ;-)

But we were at cross purposes. I was talking about the return from
queries, and you've adjusted namespaces (in toto). I've fixed up
your edit slightly, and brought the return for queries in line
with it. I hope this nit is now dead!

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-16 Thread Alex Bligh

> On 16 Dec 2016, at 15:52, Wouter Verhelst <w...@uter.be> wrote:
> 
> On Thu, Dec 15, 2016 at 05:34:48PM +0000, Alex Bligh wrote:
>> 
>>> On 15 Dec 2016, at 16:49, Wouter Verhelst <w...@uter.be> wrote:
>>> 
>>>> Because the namespaces and leaf-names are already restricted to
>>>> non-whitespace characters. I thought having tabs, line feeds,
>>>> returns, em-space, en-space etc. was not particularly useful.
>>>> I could be persuaded to relent re spaces.
>>> 
>>> I could imagine that the context might include as part of its name a
>>> user-defined bit. If we're going to disallow whitespace, then that would
>>> mean namespaces would have to do all sorts of escaping etc. I don't
>>> think that's a good idea.
>> 
>> So to be clear do you want to include all whitespace
>> everywhere? Or just to the right of the colon in queries?
> 
> Just in queries. I agree that in namespace names, it doesn't make much
> sense.

I've changed it so the returns from queries can now have spaces to
the right of the colon if they aren't returning complete context names.
Obviously if they are returning context names, by the definition of
leaf-names they can't have spaces in the leaf-name.

The queries themselves can consist of anything (currently),
save that they must begin with a namespace and a colon.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-15 Thread Alex Bligh

> On 15 Dec 2016, at 16:49, Wouter Verhelst <w...@uter.be> wrote:
> 
>> Because the namespaces and leaf-names are already restricted to
>> non-whitespace characters. I thought having tabs, line feeds,
>> returns, em-space, en-space etc. was not particularly useful.
>> I could be persuaded to relent re spaces.
> 
> I could imagine that the context might include as part of its name a
> user-defined bit. If we're going to disallow whitespace, then that would
> mean namespaces would have to do all sorts of escaping etc. I don't
> think that's a good idea.

So to be clear do you want to include all whitespace
everywhere? Or just to the right of the colon in queries?

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-15 Thread Alex Bligh
Wouter,

> This reads a bit awkward. I would do:
> 
> s/save that:/except as explained below/

Possibly a British English thing. Will fix.

>>If one or more queries are sent, then the server MUST return
>>those metadata contexts that are available to the client to
>>select on the given export with `NBD_OPT_SET_META_CONTEXT`,
>>and which match one or more of the queries given. The
>>support of wildcarding within the leaf-name portion of
>>the query string is dependent upon the namespace.
>> 
>>In either case, however, for any given namespace the
>>server MAY, instead of exhaustively listing every
>>matching context available to select (or every context
>>available to select where no query is given), send
>>sufficient context records back to allow a client with
>>knowledge of the namespace to select any context. Each
>>namespace returned MUST still satisfy the rules for
>>namespaces (i.e. they must begin with the relevant
>>namespace, followed by a colon, then printable non-whitespace
>>UTF-8 characters,
> 
> Why restrict to non-whitespace characters? (printable makes sense...)

Because the namespaces and leaf-names are already restricted to
non-whitespace characters. I thought having tabs, line feeds,
returns, em-space, en-space etc. was not particularly useful.
I could be persuaded to relent re spaces.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-15 Thread Alex Bligh

> On 14 Dec 2016, at 20:18, Wouter Verhelst <w...@uter.be> wrote:
> 
>> +* the server MAY return a context consisting of a namespace and
>> +  a colon only (i.e. omitting the leaf-name) to indicate that
>> +  the namespace contains a large number of possible contexts
>> +  within that namespace (for instance a namespace `X-backup` with
>> +  contexts that indicate whether blocks were written after
>> +  a given date might accept queries of the form
>> +  `'X-backup:modifiedtime>[unixdate]'` where `[unixdate]` is an
>> +  arbitrary integer, and in this case it might simply
>> +  return `X-backup:`)
> 
> This is way too detailed, I think. It should just allow namespaces to
> define what _LIST_ may return, as long as the client is somehow able to
> distill (through knowledge of the spec as well as the information sent
> in reply to the _LIST_ command) all the metadata contexts it can
> possibly select.


OK, this part now reads:

The server MUST either reply with an error (for instance `EINVAL`
if the option is not supported), or reply with a list of
`NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.

If zero queries are sent, then the server MUST return all
the metadata contexts that are available to the client to select
on the given export with `NBD_OPT_SET_META_CONTEXT`, save that:

If one or more queries are sent, then the server MUST return
those metadata contexts that are available to the client to
select on the given export with `NBD_OPT_SET_META_CONTEXT`,
and which match one or more of the queries given. The
support of wildcarding within the leaf-name portion of
the query string is dependent upon the namespace.

In either case, however, for any given namespace the
server MAY, instead of exhaustively listing every
matching context available to select (or every context
available to select where no query is given), send
sufficient context records back to allow a client with
knowledge of the namespace to select any context. Each
namespace returned MUST still satisfy the rules for
namespaces (i.e. they must begin with the relevant
namespace, followed by a colon, then printable non-whitespace
UTF-8 characters, with the entire string not exceeding
255 bytes). This may be helpful where a client can
construct algorithmic queries. For instance, a client might
reply simply with the namespace with no leaf-name (e.g.
'X-FooBar:') or with a range of values (e.g.
'X-ModifiedDate:20160310-20161214'). The semantics of
such a reply are a matter for the definition of the
namespace.

Hope that works.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh
Wouter,

(Our mails crossed and I've actually pushed something, but no matter)

> On 14 Dec 2016, at 18:49, Wouter Verhelst <w...@uter.be> wrote:
> 
> What I was trying to say is that I think the result to _LIST_ with no
> queries should return all information the client needs to theoretically
> build the list of all possible contexts, even if that list may be
> so large as to be unfeasible for it to be built (e.g., in case of a
> cartesian product between all possible other contexts). I gave one
> example, but there may be more.
> 
> My point is that if the query includes a namespace, the result should
> not be defined by our spec. If the query does not include a namespace,
> the result should be "complete" by whatever definition, but not
> unreasonable (i.e., don't just write a cartesian product to a client).
> 
> This could allow an interactive client to present a user with a list of
> possible contexts before performing analysis on the block device, say.

OK, so first of all, one of the changes I made earlier was that now
each of the commands carries a list of queries, the way you list
everything is not 'having a query that doesn't contain a namespace'
but rather doing a _LIST_ with no queries at all. But that's semantics
and orthogonal to the main point.

What I've proposed (and pushed - but feel free to alter it) is that

1. on _LIST_, the server can return fewer contexts than are available
   if returning all of them would consume undue levels of resources.

2. on _LIST_ where the contexts are 'algorithmic', the server can
   return e.g. 'X-Backup:' rather than 'X-Backup:modified>' and
   every integer.

3. On _SET_ if too many contexts are requested, the server may return
   an error (I think we need this anyway).

That nearly does what you ask for, but I'm not sure how you any query
could 'return all the information the client needs to build
the list of all possible contexts'. For instance, in my backup
example 'X-Backup:modified>[integer]' doesn't itself tell you
anything, as you don't know whether the integer is a unix
date time, in seconds after the epoch, milliseconds or whatever.
What, surely, as a client you want to know is 'does it support
the X-Backup: extension because I've read the spec for that and
know that it has X-Backup:modified if so'. So I've suggested it
return 'X-Backup:' only in that case, in which case from that
(*and the spec*) you know how to build any query.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh

> On 14 Dec 2016, at 18:18, Alex Bligh <a...@alex.org.uk> wrote:
> 
> Let me have a go at a change.

OK I've pushed a change. I reordered a few bits (to put the
base:allocation next to the stuff that talks about metadata
queries at the start), but the main change is below.

-- 
Alex Bligh



@@ -970,15 +1029,38 @@ of the newstyle negotiation.
- String, query to list a subset of the available metadata
  contexts.

-If zero queries are sent, then the server MUST return all
-the metadata contexts that are available to the client to select
-on the given export with `NBD_OPT_SET_META_CONTEXT`.
-
 For details on the query string, see under `NBD_OPT_SET_META_CONTEXT`.

 The server MUST either reply with an error (for instance `EINVAL`
 if the option is not supported), or reply with a list of
 `NBD_REP_META_CONTEXT` replies followed by `NBD_REP_ACK`.
+
+If zero queries are sent, then the server MUST return all
+the metadata contexts that are available to the client to select
+on the given export with `NBD_OPT_SET_META_CONTEXT`, save that:
+
+If one or more queries are sent, then the server MUST return
+those metadata contexts that are available to the client to
+select on the given export with `NBD_OPT_SET_META_CONTEXT`,
+and which match one or more of the queries given. The
+support of wildcarding within the leaf-name portion of
+the query string is dependent upon the namespace.
+
+In either case, however:
+
+* the server MAY return an incomplete list if returning a
+  complete list would require undue resources.
+
+* the server MAY return a context consisting of a namespace and
+  a colon only (i.e. omitting the leaf-name) to indicate that
+  the namespace contains a large number of possible contexts
+  within that namespace (for instance a namespace `X-backup` with
+  contexts that indicate whether blocks were written after
+  a given date might accept queries of the form
+  `'X-backup:modifiedtime>[unixdate]'` where `[unixdate]` is an
+  arbitrary integer, and in this case it might simply
+  return `X-backup:`)
+
 The metadata context ID in these replies is reserved and SHOULD be
 set to zero; clients MUST disregard it.

@@ -1009,7 +1091,9 @@ of the newstyle negotiation.

 If zero queries are sent, the server MUST select no metadata contexts.

-The server MUST reply with a number of `NBD_REP_META_CONTEXT`
+The server MAY return `NBD_REP_ERR_TOO_BIG` if a request
+seeks to select too many contexts. Otherwise
+the server MUST reply with a number of `NBD_REP_META_CONTEXT`
 replies, one for each selected metadata context, each with a unique
 metadata context ID, followed by `NBD_REP_ACK`. The metadata context
 ID is transient and may vary across calls to `NBD_OPT_SET_META_CONTEXT`;



Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh

> On 14 Dec 2016, at 18:13, Wouter Verhelst <w...@uter.be> wrote:
> 
> Instead, I would suggest that the spec leave it up to the namespace spec
> to define what gets listed when. The namespace should still give some
> indication that a particular *type* of context is available, though;
> e.g., _LIST_ could return 'backup:snapshot-diff{*}' as well as a list of
> 'backup:snapshot{snapshot-X}' contexts, where the latter are all
> available snapshots, and the 'snapshot-diff' bit implies that the whole
> cartesian product of all possible diffs between snapshots is possible in
> a format of 'backup:snapshot-diff{snapshot-X:snapshot-Y}', or some such
> (you get the idea).
> 
> The spec currently says that a server SHOULD allow choosing exports by
> simply naming them, but doesn't make that a MUST; and it also says that
> the query format is defined by the namespace. If we simply clarify that
> the output of _LIST_ doesn't necessarily need to be a full list of all
> that's available (that it may be symbolic in the case of namespaces that
> allow things to be dynamically defined), we should be good.

Let me have a go at a change.

However, note that Vladimir was answering a slightly different question.
I was asking about listing ALL contexts (previously an empty query
string, now a zero length list of queries), not a 'backup:*' type
thing.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh

> On 14 Dec 2016, at 17:55, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
>> Hmmm... Well in the '*' case, it's up to the namespace as to how it parses 
>> things, so '*' could be prohibited entirely or mean 'return the first 20 
>> matching' or similar. So that's less of a problem. And 'set all' doesn't 
>> exist (unlike 'list all').
>> 
>> I think in the LIST case we should allow the server to omit contexts under 
>> certain circumstances, and allow SET to return ETOOBIG.
>> 
> 
> We can't prohibit '*' as query string as implementation defined. And we 
> shouldn't (I think) define its meaning.

Those two statements appear to me to contradict each other. The current 
documentation *does* define the query string (to the right of the colon) as 
implementation defined. I'm thus not sure what you mean.

> Opposite, we can define, that query must not select more than 20 contexts, 
> but I'm not sure that this is good.

Each query can select from only one namespace in the current document (Wouter 
explained why). However, you can specify multiple queries.

I don't think we need to define a hard limit of a particular number.

> So, do you mean
> 
>  list('backup:modtime>*') -> empty list

The result of that would depend on however the 'backup' context was defined, 
meaning it had the options of:
* ETOOBIG
* Listing some subset of available contexts (arguably 'none' is a subset)

>  set('backup:modtime>*') -> ETOOBIG

Again, the result of that would depend on however the 'backup' context was 
defined, meaning it had the options of:
* ETOOBIG
* Listing some subset of available contexts (arguably 'none' is a subset)
... and it need not make the same choice as above, but I agree it would be 
logical for it to do so.

As any form of wildcarding within a query refers to one particular namespace 
(as a query by its nature specifies a single namespace), we don't have to worry 
about the way wildcarding works in the standard, as Wouter pointed out. We 
merely need to provide that ETOOBIG and listing a subset are acceptable 
responses.

What we do need to decide is what the response to list() (i.e. a list with a 
zero length list of queries) does. It's currently defined to return every 
context from every namespace. Options would include
* ETOOBIG
* Listing some subset of available contexts (arguably 'none' is a subset)
* Allowing abbreviation of algorithmically specified contexts (e.g. in this 
case just returning 'backup:' to represent all available backup contexts)

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh
Wouter,

> Right. I think we're getting close to a good spec now, for this thing.
> 
> One thing I've been thinking about that we might want to add:
> 
> There may be cases where a server, in performing the required calls to
> be able to handle a BLOCK_STATUS request, will end up with more
> information than the client asked; e.g., if the client asked information
> in the base:allocation context on an extent at offset X of length Y,
> then the server might conceivably do an lseek(SEEK_DATA) and/or
> lseek(SEEK_HOLE). The result of that call might be that the file offset
> will now point to a location Z, where Z > (X+Y).
> 
> Currently, our spec says "the sum of the *length* fields MUST not be
> greater than the overall *length* of request". This means that in
> essense, the server then has to throw away the information it has on the
> range Z - (X + Y). In case a client was interested in that information,
> that seems like a waste. I would therefore like to remove that
> requirement, by rephrasing that "sum of the *length* fields" thing into
> something along the following lines:
> 
>  In case the server returns N extents, the sum of the *length* fields
>  of the first N-1 extents MUST NOT be greater than the overall *length*
>  of the request. The final extent MAY exceed the length of the request
>  if the server has that information anyway as a side effect of looking
>  up the required information and wishes to share it.
> 
> This would then result in the fact that the "length" field in the
> BLOCK_STATUS command would be little more than a hint, since we're
> saying that a server can return more data than requested (if it's
> available anyway) and less data than requested (if it would be too
> resource-intensive to provide all the information). Not sure whether all
> that makes much sense anymore, but hey.

+1

> In addition, the combination of a server providing more information than
> requested with a "REQ_ONE" flag and a length field of zero could be an
> interesting way to enumerate a whole export, too. Essentially, we could
> define that as a client saying "I'm interested in what the size of the
> extent at offset X is, and what its properties are".

Also +1

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh

> On 14 Dec 2016, at 17:05, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
>> 
>> However, this raises another question. Wouter deliberately made the
>> query format freeform so that you could e.g. set a context like:
>> 
>>backup:modtime>201612081034
>> 
>> which might (in theory) return a list of blocks which are newer than
>> the given timestamp. It would clearly be impossible to return all such
>> contexts. I wonder if we should carve out an exception here.
>> 
>> 
> 
> Interesting. Even query 'backup:modtime>*' would be a problem, not only empty 
> query list.
> 
> Actually, we do not need to 'list contexts', as it is about management, not 
> about data transfer. We only need a way to check, that particular query 
> selects all needed contexts and no others. Just to be sure that we are know, 
> what we will select by NBD_OPT_SET_META_CONTEXT with _same_ query.
> 
> So, I suggest just to say, that _LIST_ can return error if too much contexts 
> are selected. And same should be done for _SET_. And it should be documented 
> that this result of query (list or error) should be equal for these two 
> commands.

(two CCs that always bounce removed)

Hmmm... Well in the '*' case, it's up to the namespace as to how it parses 
things, so '*' could be prohibited entirely or mean 'return the first 20 
matching' or similar. So that's less of a problem. And 'set all' doesn't exist 
(unlike 'list all').

I think in the LIST case we should allow the server to omit contexts under 
certain circumstances, and allow SET to return ETOOBIG.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh

> On 14 Dec 2016, at 16:58, Eric Blake <ebl...@redhat.com> wrote:
> 
> s/botht he/both the/

Thanks - fixed

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-14 Thread Alex Bligh

> ...  But your argument that 16 bits is at a premium may be worth
> considering an alternative representation for the same information.
> 
> Possibility 1: when we run out of our current 16 transmission flags,
> we'll need a way to extend the protocol to support more than that.  To
> do that, we'd probably want a new Handshake flag
> NBD_FLAG_LARGE_TRANSMISSION_FLAGS sent by the server, and which must be
> answered by the client sending a new Client flag
> NBD_FLAG_C_LARGE_TRANSMISSION_FLAGS, so that both sides know that all
> subsequent places in the protocol that send transmission flags will now
> send 64 bits instead of 16 bits (old-style negotiation cannot benefit
> from this, and is permanently stuck at 16 bits, but we discourage
> old-style negotiation).  We'd still want to prioritize which bits land
> in the first 16 positions, for maximum compatibility with older servers
> or clients (where the higher bit positions are used for data that is
> more suited for optimizations, rather than controlling correct usage) -
> thus NBD_FLAG_INIT_ZEROES would serve as an example to live in the
> extended positions.

Well, to me, this just reads like the transmission flags aren't
limited to 16 bits as you've demonstrated a workaround. Any client/server
needing to interpret more bits than the 16 just needs to also
understand extended transmission bit flags. I don't that that's
an unreasonable requirement, so I don't actually buy the prioritisation
argument.

So this seems to me like a good argument we should proceed how
you originally suggested and Wouter should be less worried about
your flag-burning activities :-)

> Possibility 2: Leverage the extension-info branch: Add a new
> NBD_INFO_INIT_ZERO information datum.  NBD_BLOCK_INFO already exists to

I think you mean NBD_INFO_BLOCK_SIZE?

> let server and client exchange as much per-export information as they
> want during handshake without burning any new transmission flags, and
> with a specification that gracefully ignores unknown requests from the
> client and unknown advertisements from the server.  There's the drawback
> that the server can't advertise the known-zero-init optimization to
> clients that don't use NBD_OPT_GO, but that's not too bad (especially if
> qemu is the first client with code to actually make use of the
> optimization, since I am already trying to get qemu to be one of the
> first adopters of NBD_OPT_GO).

I think you are suggesting add NBD_INFO_SOMETHINGELSE which could
be extended transmission flags. Fair enough.

Possibility #3: we modify NBD_INFO_EXPORT (which is still experimental)
*now* to have an extended transmission flags 64 bits. This is where
transmission flags SHOULD go, and just as we'll run out of them
in NBD_OPT_EXPORT_NAME we'll also run out of the same 16 flags in
NBD_OPT_GO. So it seems a good idea to free us from that limitation
now. And all clients (one hopes) will use NBD_OPT_GO eventually.

This has the minor disadvantage of breaking clients that rely on
the current state of the experimental extension, but that's why
it's experimental. Newer clients can talk to older servers if they
are so inclined by checking the length field on the option (which
will extend to 16 from 12).

> We'll probably have to eventually use something along the lines of
> possibility 1 for more transmission flags for some other reason down the
> road (since we have lately been adding one flag per new command/command
> group as we add extensions), but I agree that it is a bit heavy for now.
> So it sounds like I should rework the proposal along the lines of
> possibility 2, which also implies that we should get extension-info
> ready for promotion to current.  And that means that my current proposal
> to the extension-write-zeroes branch is probably not ideal, but it
> reopens the question of whether extension-write-zeroes as it currently
> stands is ready for promotion to stable now that qemu 2.8 implements it.

My view is:
#0 (your original proposal) is actually fine.
#1 seems too heavy
#2 also seems pretty heavyweight - adding a whole new info command for one
   bit
#3 is pretty simple, but carries the disadvantage that you won't be able
   to provide a reference implementation without also putting NBD_OPT_GO
   support into the reference implementation. Oh hang on, perhaps that's
   an advantage :-)

So I'd either go with #0 or #3.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh
Vladimir,

>> +non-zero number of metadata contexts during negotiation. Servers SHOULD
>> +reply to clients sending `NBD_CMD_BLOCK_STATUS without
> 
> backquote

Fixed

>> +If zero queries are sent, then the server MUST return all
>> +the metadata contexts it knows about.
> 
> I think that 'all .. it knows about' is too much. What about 'return all 
> available ..'? Anyway 'all ... it knows about' actually equals to 'all ... it 
> wants'. There may be some private, or unrelated contexts, for example..

This was not my wording, but I've changed it anyway to:

If zero queries are sent, then the server MUST return all
the metadata contexts that are available to the client to select
on the given export with `NBD_OPT_SET_META_CONTEXT`.

I think if they are available to select, we should list them. Thanks
also for reminding me to document why I put the export name into the
_LIST_ data (as it is for _SET_).

However, this raises another question. Wouter deliberately made the
query format freeform so that you could e.g. set a context like:

   backup:modtime>201612081034

which might (in theory) return a list of blocks which are newer than
the given timestamp. It would clearly be impossible to return all such
contexts. I wonder if we should carve out an exception here.

-- 
Alex Bligh







[Qemu-devel] [PATCH] Further tidy-up on block status

2016-12-14 Thread Alex Bligh
(NB: I've already applied this and pushed it)

* Change NBD_OPT_LIST_METADATA etc. to explicitly send a list of queries
  and add a count of queries so we can extend the command later (rather than
  rely on the length of option)

* For NBD_OPT_LIST_METADATA make absence of any query (as opposed to zero
  length query) list all contexts, as absence of any query is now simple.

* Move definition of namespaces in the document to somewhere more appopriate.

* Various other minor changes as discussed on the mailing list

Signed-off-by: Alex Bligh <a...@alex.org.uk>
---
 doc/proto.md | 179 +--
 1 file changed, 112 insertions(+), 67 deletions(-)

diff --git a/doc/proto.md b/doc/proto.md
index e03f434..6955c76 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -683,15 +683,20 @@ by other users, so has been moved out of the 
"experimental" section.
 
 ## Metadata querying
 
-With the availability of sparse storage formats, it is often needed to
-query the status of a particular range and read only those blocks of
-data that are actually present on the block device.
+It is often helpful for the client to be able to query the status
+of a range of blocks. The nature of the status that can be
+queried is in part implementation dependent. For instance,
+the status might represent:
 
-Some storage formats and operations over such formats express a
-concept of data dirtiness. Whether the operation is block device
-mirroring, incremental block device backup or any other operation with
-a concept of data dirtiness, they all share a need to provide a list
-of ranges that this particular operation treats as dirty.
+* in a sparse storage format, whether the relevant blocks are
+  actually present on the backing device for the export; or
+
+* whether the relevant blocks are 'dirty'; some storage formats
+  and operations over such formats express a concept of data dirtiness.
+  Whether the operation is block device mirroring, incremental block
+  device backup or any other operation with a concept of data dirtiness,
+  they all share a need to provide a list of ranges that this
+  particular operation treats as dirty.
 
 To provide such classes of information, the NBD protocol has a generic
 framework for querying metadata; however, its use must first be
@@ -699,9 +704,11 @@ negotiated, and one or more metadata contexts must be 
selected.
 
 The procedure works as follows:
 
-- First, during negotiation, the client MUST select one or more metadata
+- First, during negotiation, if the client wishes to query metadata
+  during transmission, the client MUST select one or more metadata
   contexts with the `NBD_OPT_SET_META_CONTEXT` command. If needed, the client
-  can use `NBD_OPT_LIST_META_CONTEXT` to list contexts.
+  can use `NBD_OPT_LIST_META_CONTEXT` to list contexts that the server
+  supports.
 - During transmission, a client can then indicate interest in metadata
   for a given region by way of the `NBD_CMD_BLOCK_STATUS` command, where
   *offset* and *length* indicate the area of interest. The server MUST
@@ -715,13 +722,37 @@ The procedure works as follows:
   of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
 
 A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
-nonzero number of metadata contexts during negotiation. Servers SHOULD
-reply to clients doing so anyway with `EINVAL`.
+non-zero number of metadata contexts during negotiation. Servers SHOULD
+reply to clients sending `NBD_CMD_BLOCK_STATUS without
+selecting metadata contexts with `EINVAL`.
 
-The reply to the `NBD_CMD_BLOCK_STATUS` request MUST be sent by a
+The reply to the `NBD_CMD_BLOCK_STATUS` request MUST be sent as a
 structured reply; this implies that in order to use metadata querying,
 structured replies MUST be negotiated first.
 
+Metadata contexts are identified by their names. The name MUST
+consist of a namespace, followed by a colon, followed by a leaf-name.
+The namespace and the leaf-name must each consist entirely of
+printable non-whitespace UTF-8 characters other than colons,
+and be non-empty. The entire name (namespace, colon and leaf-name)
+MUST NOT exceed 255 bytes (and therefore botht he namespace and
+leaf-name are guaranteed to be smaller than 255 bytes).
+
+Namespaces MUST be consist of one of the following:
+- `base`, for metadata contexts defined by this document;
+- `nbd-server`, for metadata contexts defined by the
+   implementation that accompanies this document (none
+   currently);
+- `x-*`, where `*` can be replaced by any random string not
+   containing colons, for local experiments. This SHOULD NOT be
+   used by metadata contexts that are expected to be widely used.
+- A third-party namespace from the list below.
+
+Third-party implementations can register additional namespaces by simple
+request to the mailing-list. The following additional third-party namespaces
+are currently registered:
+* (none)
+
 This standard defines exactly one 

Re: [Qemu-devel] [Nbd] [PATCH v5] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-13 Thread Alex Bligh

> On 13 Dec 2016, at 16:06, Wouter Verhelst <w...@uter.be> wrote:
> 
>> 
>> "If a server supports the `base:allocation` metadata context, then writing
>> to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC
>> unless for reasons specified in the definition of another context."
> 
> That is essentially what I mean, yes. The point though, is that I also
> think a client should assume the worst.
> 
> (additionally, there may be reasons for ENOSPC that the server isn't
> aware of; so even if NBD_STATE_HOLE is clear, it should still not be an
> error for the server to return ENOSPC)

All of this suggests 'SHOULD NOT' would be more appropriate than
'MUST NOT'.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH v5] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-13 Thread Alex Bligh
> 
> (number of people telling me that. Yes, I know :-)

Sorry, behind on mail :-)

>>> +# Metadata contexts
>>> +
>>> +The `base:allocation` metadata context is the basic "allocated at all"
>>> +metadata context. If an extent is marked with `NBD_STATE_HOLE` at that
>>> +context, this means that the given extent is not allocated in the
>>> +backend storage, and that writing to the extent MAY result in the ENOSPC
>>> +error. This supports sparse file semantics on the server side. If a
>>> +server has only one metadata context (the default), then writing to an
>>> +extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.
>> 
>> Again I'm still confused by this. I *think* you mean "If a server
>> supports the `base:allocation` metadata context, then writing
>> to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC.`
>> 
>> I say that because as currently phrased:
>> 
>> * If a server has one metadata context only, but it is not
>>  `base:allocation`, then you implying something about writing
>>  to an extent with a state that won't even exist.
> 
> Yes, that's wrong indeed. We should replace the "the default" bit by
> "the base:allocation context".
> 
>> * If a server has `base:allocation` AND another metadata context
>>  (for instance `qemu:dirty`) then the rule you set out will not
>>  apply.
> 
> Yes, and this is intentional, as I've explained before. The other
> context's semantics should clarify whether that rule still applies or
> not. Implementations that do not know of the other context should
> however assume that it doesn't.

I think I must be being very stupid or am at least at cross-purposes.

Let's say the second context is (e.g.) 'colour:records_information_about_blue'
(i.e. nothing at all to do with space or holes). Are you saying that the
presence of that context might affect the interpretation of
NBD_STATE_HOLE? Or just other contexts within the 'base:' space?
Or that another context (e.g. perhaps compression) could be an independent
cause of ENOSPC? If so should it be the responsibility of the other context
to document that it does not affect that?

How about:

"If a server supports the `base:allocation` metadata context, then writing
to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail with ENOSPC
unless for reasons specified in the definition of another context."

>>> +For the `base:allocation` context, the remainder of the flags field is
>>> +reserved. Servers SHOULD set it to all-zero;
>> 
>> Surely if we want to reserve them for extension, we need "Servers
>> MUST set it to all-zero"
> 
> No, SHOULD, otherwise a future extension which adds meaning to those
> bits will suddenly become incompatible with this spec. Think about it
> ;-)

I did! If there is a future extension, it will change the spec to
incorporate those bits, so they won't be included within 'the
remainder' any more.

> (feel free to update the branch with those suggestions I've not NAK'd,
> as I think they make sense...)

OK. May take a look later.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH v5] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-13 Thread Alex Bligh
Stefan,

> What about querying "base:" if the NBD spec adds more standard metadata
> contexts in the future?  "base:" does not "uniquely identify a server
> implementation" but it should still be possible to query it so that
> additional contexts can be added to this spec in the future.
> 
> Is the query matching algorithm defined anywhere?  I guess it is
> strncmp(query, context, strlen(query)) but I'm not sure from this text.
> 
> Another common syntax would use an asterisk wildcard ('*') so that it's
> possible to differentiate between full matches and (wildcard) partial
> matches.

I agree. I have suggested a simple string-based left match, which
also fixes ...

>> +- third-party implementations can register additional
>> +  namespaces by simple request to the mailinglist.
> 
> Does this mean it is possible to activate multiple contexts but only if
> their namespace is identical?  That seems like an arbitrary limitation.
> 
> In other words, the spec suggests you can activate nbd-server:a and
> nbd-server:b but you cannot activate base:a and nbd-server:a.

... this, provided that NBD_SET_META_CONTEXT actually takes a list of
query strings.

> I'd prefer a programming model where the contexts don't need to be set
> for the lifetime of the connection.  Instead the client passes a bitmap
> (64-bits?) of contexts along with each BLOCK_STATUS command.  That way
> the client can select contexts on a per-command basis.  It's unlikely
> that more than 64 contexts need to be available at once but I admit it's
> an arbitrary limitation...
> 
> I guess you've considered this model and decided it's better to
> negotiate upfront, it's wasteful to enable multiple contexts and then
> discard the response data on the client side because only a subset is
> needed for this command invocation.

The issue is that there's nowhere within the current command structure
to put it.

I think the easy way to do this would be to have context IDs as a payload
to the command, like they are with NBD_CMD_WRITE, but the payload
is currently always defined to be of the length specified within the
length section.

The question really is whether we should fix this silly protocol limitation.
I don't think it's as bad as the structured reply fix, and could conceivably
go in there.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH v5] doc: Add NBD_CMD_BLOCK_STATUS extension (part 2)

2016-12-13 Thread Alex Bligh

> On 12 Dec 2016, at 20:43, Wouter Verhelst <w...@uter.be> wrote:
> 
> +- `NBD_OPT_SET_META_CONTEXT` (11)
> +
> +Change the set of active metadata contexts. Issuing this command
> +replaces all previously-set metadata contexts; clients must ensure
> +that all metadata contexts they're interested in are selected with
> +the final query that they sent.
> +
> +Data:
> +- 32 bits, length of query
> +- String, query to select metadata contexts. The syntax of this
> +  query is implementation-defined, except that it MUST start with a
> +  namespace. This namespace may be one of the following:
> +- `base:`, for metadata contexts defined by this document;
> +- `nbd-server:`, for metadata contexts defined by the
> +  implementation that accompanies this document (none
> +  currently);
> +- `x-*:`, where `*` can be replaced by any random string not
> +  containing colons, for local experiments. This SHOULD NOT be
> +  used by metadata contexts that are expected to e widely used.
> +- third-party implementations can register additional
> +  namespaces by simple request to the mailinglist.

Surely also we need to specify multiple queries?


--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH v5] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-13 Thread Alex Bligh
ource-intensive for the server.
> +
> All error chunk types have bit 15 set, and begin with the same
> *error*, *message length*, and optional *message* fields as
> `NBD_REPLY_TYPE_ERROR`.  If non-zero, *message length* indicates
> @@ -1085,7 +1280,7 @@ remaining structured fields at the end.
>   were sent earlier in the structured reply, the server SHOULD NOT
>   send multiple distinct offsets that lie within the bounds of a
>   single content chunk.  Valid as a reply to `NBD_CMD_READ`,
> -  `NBD_CMD_WRITE`, and `NBD_CMD_TRIM`.
> +  `NBD_CMD_WRITE`, `NBD_CMD_TRIM`, and `NBD_CMD_BLOCK_STATUS`.
> 
>   The payload is structured as:
> 
> @@ -1259,6 +1454,44 @@ The following request types exist:
> 
> Defined by the experimental `WRITE_ZEROES` 
> [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-write-zeroes/doc/proto.md).
> 
> +* `NBD_CMD_BLOCK_STATUS` (7)
> +
> +A block status query request. Length and offset define the range of
> +interest. Clients MUST NOT use this request unless metadata
> +contexts have been negotiated,

I think you mean "have been selected" but see my comment re perhaps
making no contexts being selected meaning all contexts are selected.

> which in turn requires the client to
> +first negotiate structured replies. For a successful return, the
> +server MUST use a structured reply, containing at least one chunk of
> +type `NBD_REPLY_TYPE_BLOCK_STATUS`, where the status field of each
> +descriptor is determined by the flags field as defined by the
> +metadata context.
> +
> +The list of block status descriptors within the
> +`NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
> +of the file starting from specified *offset*, and the sum of the
> +*length* fields of each descriptor MUST not be greater than the
> +overall *length* of the request. This means that the server MAY
> +return less data than required. However the server MUST return at
> +least one status descriptor.  The server SHOULD use different
> +*status* values between consecutive descriptors, and SHOULD use
> +descriptor lengths that are an integer multiple of 512 bytes where
> +possible (the first and last descriptor of an unaligned query being
> +the most obvious places for an exception). The status flags are
> +intentionally defined so that a server MAY always safely report a
> +status of 0 for any block, although the server SHOULD return
> +additional status values when they can be easily detected.
> +
> +If an error occurs, the server SHOULD set the appropriate error
> +code in the error field of an error chunk. However, if the error
> +does not involve invalid usage (such as a request beyond the bounds
> +of the file), a server MAY reply with a single block status
> +descriptor with *length* matching the requested length, and *status*
> +of 0 rather than reporting the error.
> +
> +A client MAY initiate a hard disconnect if it detects that the
> +server has sent an invalid chunk. The server SHOULD return `EINVAL`
> +if it receives a `NBD_CMD_BLOCK_STATUS` request including one or
> +more sectors beyond the size of the device.
> +
> * Other requests
> 
> Some third-party implementations may require additional protocol
> 
> --
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>   people in the world who think they really understand all of its rules,
>   and pretty much all of them are just lying to themselves too.
> -- #debian-devel, OFTC, 2016-02-12
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! 
> http://sdm.link/slashdot___
> Nbd-general mailing list
> nbd-gene...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-13 Thread Alex Bligh

> On 13 Dec 2016, at 12:18, Wouter Verhelst <w...@uter.be> wrote:
> 
> I'm not opposed either, but I do agree with you that we shouldn't add
> such a feature if it doesn't end up getting used. Especially so if it
> burns a flag in the (16-bit) "transmission flags" field, where space is
> at a premium.

I did suggest a few non-Qemu uses (see below). I think it might be
an idea if the reference implementation supported it before
merging (which per below should be trivial).

-- 
Alex Bligh


> Begin forwarded message:
> 
> From: Alex Bligh <a...@alex.org.uk>
> Subject: Re: [Nbd] [Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES 
> extension
> Date: 6 December 2016 at 10:29:41 United Kingdom Time
> To: Kevin Wolf <kw...@redhat.com>
> Cc: Alex Bligh <a...@alex.org.uk>, Eric Blake <ebl...@redhat.com>, 
> "nbd-gene...@lists.sourceforge.net" <nbd-gene...@lists.sourceforge.net>, 
> xieying...@huawei.com, su...@huawei.com, qemu block <qemu-bl...@nongnu.org>, 
> "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Paolo Bonzini 
> <pbonz...@redhat.com>
> 
> 
>> On 6 Dec 2016, at 09:25, Kevin Wolf <kw...@redhat.com> wrote:
>> 
>> Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
>>> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
>>> team discovered that it is useful if a server can advertise
>>> whether an export is in a known-all-zeroes state at the time
>>> the client connects.
>> 
>> Does a server usually have the information to set this flag, other than
>> querying the block status of all blocks at startup? 
> 
> The server may have other ways of knowing this, for instance
> that it has just created the file (*), or that it stat'd the file
> before opening it (not unlikely) and noticed it had 0 allocated
> size. The latter I suspect would be trivial to implement in nbd-server
> 
> (*) = e.g. I had one application where nbd use the export path
> to signify it wanted to open a temporary file, the path consisting
> of a UUID and an encoded length. If the file was not present already
> it created it with ftruncate(). That could trivially have used this.
> 
>> If so, the client could just query this by itself.
> 
> Well there's no currently mainlined extension to do that, but yes
> it could. On the other hand I see no issue passing complete
> zero status back to the client if it's so obvious from a stat().
> 
> -- 
> Alex Bligh
> 
> 
> 
> 





Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-08 Thread Alex Bligh

> On 8 Dec 2016, at 15:59, Eric Blake <ebl...@redhat.com> wrote:
> 
> We should use similar wording to whatever we already say about what a
> client would see when reading data cleared by NBD_CMD_TRIM.  After all,
> the status of STATE_HOLE set and STATE_ZERO clear is what you logically
> get when TRIM cannot guarantee reads-as-zero.

Yes. It was actually exactly that discussion I was trying to remember.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-08 Thread Alex Bligh
o make it a SHOULD)

Don't see why it should even be a 'SHOULD' to be honest. An nbd
server cooked up for a specific purpose, or with a backend that
can't provide that (or where there is never a hole) shouldn't
be criticised.

>> I think the last sentence is probably meant to read something like:
>> 
>> If a server supports the "BASE:allocation" metadata context, then
>> writing to an extent which has `NBD_STATE_HOLE` clear MUST NOT fail
>> with ENOSPC.
> 
> No, it can't.
> 
> Other metadata contexts may change the semantics to the extent that if
> the block is allocated at the BASE:allocation context, it would still
> need to space on another plane. In that case, writing to an area which
> is not STATE_HOLE might still fail.

I understand the point (though I'm slightly dubious about it given
the discussion we had with the WRITE_ZEROES extension where we agreed
I thought that the purpose of actual zeroes being written to disk
was to avoid ENOSPC errors). However, if that is the case, I'm not
sure what you're trying to say in the original text.

>> Why 512 bytes as opposed to 'minimum block size' (or is it because
>> that is also an experimental extension)?
> 
> Yes, and this extension does not depend on that one. Hence why it is
> also a SHOULD and not a MUST.

OK. As a separate discussion I think we should talk about promoting
WRITE_ZEROES and INFO. The former has a reference implementation,
I think Eric did a qemu implementation, and I did a gonbdserver
implementation. The latter I believe lacks a reference implementation.

>> +  The status flags are
>> +intentionally defined so that a server MAY always safely report a
>> +status of 0 for any block, although the server SHOULD return
>> +additional status values when they can be easily detected.
>> +
>> +If an error occurs, the server SHOULD set the appropriate error
>> +code in the error field of either a simple reply or an error
>> +chunk.
>> 
>> We should probably point out that you can return an error half way
>> through - that being the point of structured replies.
> 
> Right.
> 
> (also, I think it MUST NOT send a simple reply, but I'll fix that up
> separately)

A simple error reply would normally be permitted.

>> +  However, if the error does not involve invalid usage (such
>> +as a request beyond the bounds of the file), a server MAY reply
>> +with a single block status descriptor with *length* matching the
>> +requested length, and *status* of 0 rather than reporting the
>> +error.
>> 
>> Wht's the point of this? This appears to say that a server can lie
>> and return everything as not a hole, and not zero! Surely we're
>> already covered from the DoS angle?
> 
> I'm not sure, I believe that wording came from the original patch on
> which I based my work.

Sure, I think it's left over detritus.

>> +Upon receiving an `NBD_CMD_BLOCK_STATUS` command, the server MUST
>> +return the status of the device,
>> 
>> status of the metadata context
> 
> No, status of the device. A metadata context *describes* status, it
> *isn't* one.
> 
> Perhaps "status of the device as per the given metadata context", but
> hey.

It's actually 'device' I'm arguing with. Server side we refer to them
as 'exports'. Often they aren't files at all. In gonbdserver it
might be a Ceph connection elsewhere.

'return the status of the relevant portion of the export (as per the
given metadata context'?

>> +A client SHOULD NOT read from an area that has both `NBD_STATE_HOLE`
>> +set and `NBD_STATE_ZERO` clear.
>> 
>> That makes no sense, as normal data has both these bits clear! This
>> also implies that to comply with this SHOULD, a client needs to
>> request block status before any read, which is ridiculous. This
>> should be dropped.
> 
> No, it should not, although it may need rewording. It clarifies that
> having STATE_HOLE set (i.e., there's no valid data in the given range)
> and STATE_ZERO clear (i.e., we don't assert that it would read as
> all-zeroes) is not an invalid thing for a server to set. The spec here
> clarifies what a client should do with that information if it gets it
> (i.e., "don't read it, it doesn't contain anything interesting").

That's fair enough until the last bit in brackets. Rather than saying
a client SHOULD NOT read it, it should simply say that a read on
such areas will succeed but the data read is undefined (and may
not be stable).

> My WIP patch moves this out from the (older) "BLOCK_STATUS extension"
> section and into the main body of the spec. It also makes a few changes
> in wording as per what Vladimir suggested, and I was working on an
> NBD_OPT_LIST_META_CONTEXT rather than an NBD_OPT_META_CONTEXT
> negotiation option, with the idea that I'd add an OPT_ADD_META_CONTEXT
> and an OPT_DEL_META_CONTEXT later. Your idea of using a SET has merit
> though, so I'll update it to that effect.
> 
> It already removed the two bits that BASE:allocation doesn't use, and
> makes a few other changes as well. I haven't had the time to finish it
> and send it out for review though, but I'll definitely include your
> comments now.

Thanks.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-08 Thread Alex Bligh

> On 8 Dec 2016, at 06:58, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> An idea: let's not use uppercase. Why to shout the namespace? 'base' and 'x-' 
> would be better I think. BASE and X- will provoke all user-defined namespaces 
> be in uppercase too and a lot of uppercase will come to the code =(

I agree

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-07 Thread Alex Bligh

> On 2 Dec 2016, at 18:45, Alex Bligh <a...@alex.org.uk> wrote:
> 
> Thanks. That makes sense - or enough sense for me to carry on commenting!


I finally had some time to go through this extension in detail. Rather
than comment on all the individual patches, I squashed them into a single
commit, did a 'git format-patch' on it, and commented on that.

diff --git a/doc/proto.md b/doc/proto.md
index c443494..9c0981f 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -871,6 +869,50 @@ of the newstyle negotiation.

Defined by the experimental `INFO` 
[extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-info/doc/proto.md).

+- `NBD_OPT_META_CONTEXT` (10)
+
+Return a list of `NBD_REP_META_CONTEXT` replies, one per context,
+followed by an `NBD_REP_ACK`. If a server replies to such a request
+with no error message, clients

"*the* server" / "*the* cient"

Perhaps only an 'NBD_REP_ERR_UNSUP' error should prevent the
client querying the server.

+MAY send NBD_CMD_BLOCK_STATUS
+commands during the transmission phase.

Add: "If a server replies to such a request with NBD_REP_ERR_UNSUP,
the client MUST NOT send NBD_CMD_BLOCK_STATUS commands during the
transmission phase."

+
+If the query string is syntactically invalid, the server SHOULD send
+`NBD_REP_ERR_INVALID`.

'MUST send' else it implies sending nothing is permissible.

+If the query string is syntactically valid
+but finds no metadata contexts, the server MUST send a single
+reply of type `NBD_REP_ACK`.
+
+This option MUST NOT be requested unless structured replies have

Active voice better:

"The client MUST NOT send this option unless" ...

+been negotiated first. If a client attempts to do so, a server
+SHOULD send `NBD_REP_ERR_INVALID`.
+
+Data:
+- 32 bits, type
+- String, query to select a subset of the available metadata
+  contexts. If this is not specified (i.e., length is 4 and no
+  command is sent), then the server MUST send all the metadata
+  contexts it knows about. If specified, this query string MUST
+  start with a name that uniquely identifies a server
+  implementation; e.g., the reference implementation that
+  accompanies this document would support query strings starting
+  with 'nbd-server:'

Why not just define the format of a metadata-context string to be
of the form ':' (perhaps this definition goes
elsewhere), and here just say the returned list is a left-match
of all the available metadata-contexts, i.e. all those metadata
contexts whose names either start or consist entirely of the
specified string. If an empty string is specified, all metadata
contexts are returned.

+
+The type may be one of:
+- `NBD_META_LIST_CONTEXT` (1): the list of metadata contexts
+  selected by the query string is returned to the client without
+  changing any state (i.e., this does not add metadata contexts
+  for further usage).

Somewhere it should say this list is returned by sending
zero or more NBD_REP_META_CONTEXT records followed by a NBD_REP_ACK.

+- `NBD_META_ADD_CONTEXT` (2): the list of metadata contexts
+  selected by the query string is added to the list of existing
+  metadata contexts.
+- `NBD_META_DEL_CONTEXT` (3): the list of metadata contexts
+  selected by the query string is removed from the list of used
+  metadata contexts. Servers SHOULD NOT reuse existing metadata
+  context IDs.
+
+The syntax of the query string is not specified, except that
+implementations MUST support adding and removing individual metadata
+contexts by simply listing their names.

This seems slightly over complicated. Rather than have a list held
by the server of active metadata contexts, we could simply have
two NBD_OPT_ commands, say NBD_OPT_LIST_META_CONTEXTS and
NBD_OPT_SET_META_CONTEXTS (which simply sets a list). Then no
need for 'type', and _ADD_ and _DEL_.

 Option reply types

These values are used in the "reply type" field, sent by the server
@@ -882,7 +924,7 @@ during option haggling in the fixed newstyle negotiation.
information is available, or when sending data related to the option
(in the case of `NBD_OPT_LIST`) has finished. No data.

...

+- `NBD_REP_META_CONTEXT` (4)
+
+A description of a metadata context. Data:
+
+- 32 bits, NBD metadata context ID.
+- String, name of the metadata context. This is not required to be
+  a human-readable string, but it MUST be valid UTF-8 data.

I would suggest puttig in a length of the string before the string,
which will allow us to expand this later to add fields if necessary.
This seems to be what we are doing elsewhere.

+
+This specification declares one metadata context. It is called
+"BASE:allocation" and contains the basic "exists at all" context.
+
There are a number of error reply types, all of which are 

Re: [Qemu-devel] [Nbd] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Alex Bligh

> On 6 Dec 2016, at 08:46, Alex Bligh <a...@alex.org.uk> wrote:
> 
> I would support this.
> 
> In fact the patch is sufficiently simple I think I'd merge this
> into extension-write-zeroes then merge that into master.

Hence:

Reviewed-By: Alex Bligh <a...@alex.org.uk>

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [Qemu-block] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Alex Bligh

> On 6 Dec 2016, at 09:25, Kevin Wolf <kw...@redhat.com> wrote:
> 
> Am 06.12.2016 um 00:42 hat Eric Blake geschrieben:
>> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
>> team discovered that it is useful if a server can advertise
>> whether an export is in a known-all-zeroes state at the time
>> the client connects.
> 
> Does a server usually have the information to set this flag, other than
> querying the block status of all blocks at startup? 

The server may have other ways of knowing this, for instance
that it has just created the file (*), or that it stat'd the file
before opening it (not unlikely) and noticed it had 0 allocated
size. The latter I suspect would be trivial to implement in nbd-server

(*) = e.g. I had one application where nbd use the export path
to signify it wanted to open a temporary file, the path consisting
of a UUID and an encoded length. If the file was not present already
it created it with ftruncate(). That could trivially have used this.

> If so, the client could just query this by itself.

Well there's no currently mainlined extension to do that, but yes
it could. On the other hand I see no issue passing complete
zero status back to the client if it's so obvious from a stat().

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] doc: Propose NBD_FLAG_INIT_ZEROES extension

2016-12-06 Thread Alex Bligh

> On 5 Dec 2016, at 23:42, Eric Blake <ebl...@redhat.com> wrote:
> 
> While not directly related to NBD_CMD_WRITE_ZEROES, the qemu
> team discovered that it is useful if a server can advertise
> whether an export is in a known-all-zeroes state at the time
> the client connects.

I think this is good to go, and ...

> Signed-off-by: Eric Blake <ebl...@redhat.com>
> ---
> doc/proto.md | 5 +
> 1 file changed, 5 insertions(+)
> 
> This replaces the following qemu patch attempt:
> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00357.html
> which tried to add NBD_CMD_HAS_ZERO_INIT with poor semantics. The
> semantics in this proposal should be much better.
> 
> Patch is to the merge of the master branch and the
> extension-write-zeroes branch.  By the way, qemu 2.8 is due
> to be released "real soon now", and implements NBD_CMD_WRITE_ZEROES,
> so maybe it is time to consider promoting the extension-write-zeroes
> branch into master.

I would support this.

In fact the patch is sufficiently simple I think I'd merge this
into extension-write-zeroes then merge that into master.

Wouter?

Alex

> diff --git a/doc/proto.md b/doc/proto.md
> index afe71fc..7e4ec7f 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -697,6 +697,11 @@ The field has the following format:
>   the export.
> - bit 9, `NBD_FLAG_SEND_BLOCK_STATUS`: defined by the experimental
>   `BLOCK_STATUS` 
> [extension](https://github.com/NetworkBlockDevice/nbd/blob/extension-blockstatus/doc/proto.md).
> +- bit 10, `NBD_FLAG_INIT_ZEROES`: Indicates that the server guarantees
> +  that at the time transmission phase begins, all offsets within the
> +  export read as zero bytes.  Clients MAY use this information to
> +  avoid writing to sections of the export that should still read as
> +  zero after the client is done writing.
> 
> Clients SHOULD ignore unknown flags.
> 
> -- 
> 2.9.3
> 
> 
> --
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today.http://sdm.link/xeonphi
> ___
> Nbd-general mailing list
> nbd-gene...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general
> 

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-03 Thread Alex Bligh

> On 2 Dec 2016, at 20:39, John Snow <js...@redhat.com> wrote:
> 
> OK. We do certainly support multiple bitmaps being active at a time in
> QEMU, but I had personally always envisioned that you'd associate them
> one-at-a-time when starting the NBD export of a particular device.
> 
> I don't have a use case in my head where two distinct bitmaps being
> exposed simultaneously offer any particular benefit, but maybe there is
> something. I'm sure there is.

The obvious case is an allocation bitmap, and a dirty bitmap.

It's possible one might want more than one dirty bitmap at once.
Perhaps two sorts of backup, or perhaps live migrate of storage
backed by NBD, or perhaps inspecting the *state* of live migration
via NBD and a bitmap, or perhaps determining what extents in
a QCOW image are in that image itself (as opposed to the image
on which it is based).

I tried to pick some QEMU-like ones, but I am sure there are
examples that would work outside of QEMU.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-12-02 Thread Alex Bligh
t;dirty" bit to the specification, we allow users to clear those bits.
> 
> Then, whether the user was trying to do (A) or (B) or the unspeakable
> amalgamation of both things, it's up to the user to clear the bits
> desired and QEMU can do the simple task of simply always merging the
> bitmap fork upon the conclusion of the NBD fleecing exercise.
> 
> Maybe this would allow the dirty bit to have a bit more concrete meaning
> for the NBD spec: "The bit stays dirty until the user clears it, and is
> set when the matching block/extent/etc is written to."
> 
> With an exception that external management may cause the bits to clear.
> (I.e., someone fiddles with the backing store in a way opaque to NBD,
> e.g. someone clears the bitmap directly through QEMU instead of via NBD.)

There is currently one possible "I've done with the entire bitmap"
signal, which is closing the connection. This has two obvious
problems. Firstly if used, it discards the entire bitmap (not bits).
Secondly, it makes recovery from a broken TCP session difficult
(as either you treat a dirty close as meaning the bitmap needs
to hang around, in which case you have a garbage collection issue,
or you treat it as needing to drop the bitmap, in which case you
can't recover).

I think in your plan the block status doesn't change once the bitmap
is forked. In that case, adding some command (optional) to change
the status of the bitmap (or simply to set a given extent to status X)
would be reasonable. Of course whether it's supported could be dependent
on the bitmap.

> Having missed most of the discussion on v1/v2, is it a given that we
> want in-band identification of bitmaps?
> 
> I guess this might depend very heavily on the nature of the definition
> of the "dirty bit" in the NBD spec.

I don't think it's a given. I think Wouter & I came up with it at
the same time as a way to abstract the bitmap/extent concept and
remove the need to specify a dirty bit at all (well, that's my excuse
anyway).

> Anyway, I hope I am being useful and just not more confounding. It seems
> to me that we're having difficulty conveying precisely what it is we're
> trying to accomplish, so I hope that I am making a good effort in
> elaborating on our goals/requirements.

Yes absolutely. I think part of the challenge is that you are quite
reasonably coming at it from the point of view of qemu's particular
need, and I'm coming at it from 'what should the nbd protocol look
like in general' position, having done lots of work on the protocol
docs (though I'm an occasional qemu contributor). So there's necessarily
a gap of approach to be bridged.

I'm overdue on a review of Wouter's latest patch (partly because I need
to re-diff it against the version with no NBD_CMD_BLOCK_STATUS in),
but I think it's a bridge worth building.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh
Vladimir,

>>> 4. Q: Should not get_{allocated,dirty} be separate commands?
>>>   cons: Two commands with almost same semantic and similar means?
>>>   pros: However here is a good point of separating clearly defined and 
>>> native
>>> for block devices GET_BLOCK_STATUS from user-driven and actually
>>> undefined data, called 'dirtyness'.
>> I'm suggesting one generic 'read bitmap' command like you.
> 
> To support get_block_status in this general read_bitmap, we will need to 
> define something like 'multibitmap', which allows several bits per chunk, as 
> allocation data has two: zero and allocated.

I think you are saying that for arbitrary 'bitmap' there might be more than one 
state. For instance, one might (in an allocation 'bitmap') have a hole, a 
non-hole-zero, or a non-hole-non-zero.

In the spec I'd suggest, for one 'bitmap', we represent the output as extents. 
Each extent has a status. For the bitmap to be useful, at least two status need 
to be possible, but the above would have three. This could be internally 
implemented by the server as (a) a bitmap (with two bits per entry), (b) two 
bitmaps (possibly with different granularity), (c) something else (e.g. reading 
file extents, then if the data is allocated manually comparing it against zero).

I should have put 'bitmap' in quotes in what I wrote because returning extents 
(as you suggested) is a good idea, and there need not be an actual bitmap.

>>> 5. Number of status descriptors, sent by server, should be restricted
>>>   variants:
>>>   1: just allow server to restrict this as it wants (which was done in v3)
>>>   2: (not excluding 1). Client specifies somehow the maximum for number
>>>  of descriptors.
>>>  2.1: add command flag, which will request only one descriptor
>>>   (otherwise, no restrictions from the client)
>>>  2.2: again, introduce extended nbd requests, and add field to
>>>   specify this maximum
>> I think some form of extended request is the way to go, but out of
>> interest, what's the issue with as many descriptors being sent as it
>> takes to encode the reply? The client can just consume the remainder
>> (without buffering) and reissue the request at a later point for
>> the areas it discarded.
> 
> the issue is: too many descriptors possible. So, (1) solves it. (2) is 
> optional, just to simplify/optimize client side.

I think I'd prefer the server to return what it was asked for, and the client 
to deal with it. So either the client should be able to specify a maximum 
number of extents (and if we are extending the command structure, that's 
possible) or we deal with the client consuming and retrying unwanted extents. 
The reason for this is that it's unlikely the server can know a priori the 
number of extents which is the appropriate maximum for the client.

>>> +The list of block status descriptors within the
>>> +`NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
>>> +of the file starting from specified *offset*, and the sum of the
>>> +*length* fields of each descriptor MUST not be greater than the
>>> +overall *length* of the request. This means that the server MAY
>>> +return less data than required. However the server MUST return at
>>> +least one status descriptor
>> I'm not sure I understand why that's useful. What should the client
>> infer from the server refusing to provide information? We don't
>> permit short reads etc.
> 
> if the bitmap is 010101010101 we will have too many descriptors. For example, 
> 16tb disk, 64k granularity -> 2G of descriptors payload.

Yep. And the cost of consuming and retrying is quite high. One option would be 
for the client to realise this is a possibility, and not request the entire 
extent map for a 16TB disk, as it might be very large! Even if the client 
worked at e.g. a 64MB level (where they'd get a maximum of 1024 extents per 
reply), this isn't going to noticeably increase the round trip timing. One 
issue here is that to determine a 'reasonable' size, the client needs to know 
the minimum length of any extent.

I think the answer is probably a 'maximum number of extents' in the request 
packet.

Of course with statuses in extent, the final extent could be represented as 'I 
don't know, break this bit into a separate request' status.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh

> On 29 Nov 2016, at 10:50, Wouter Verhelst <w...@uter.be> wrote:
> 
> +- `NBD_OPT_ALLOC_CONTEXT` (10)
> +
> +Return a list of `NBD_REP_ALLOC_CONTEXT` replies, one per context,
> +followed by an `NBD_REP_ACK`. If a server replies to such a request
> +with no error message, clients MAY send NBD_CMD_BLOCK_STATUS
> +commands during the transmission phase.

I haven't read this in detail but this seems to me to be similar to
the idea I just posted (sorry - kept getting interrupted whilst writing
the email) re multiple bitmaps?

But the name 'ALLOC_CONTEXT' is a bit weird. Why not call 'metadata
bitmaps' or 'metadata extents' or something. Metadata seems right as
it's data about data.

> +# Allocation contexts
> +
> +Allocation context 0 is the basic "exists at all" allocation context. If
> +an extent is not allocated at allocation context 0, it MUST NOT be
> +listed as allocated at another allocation context. This supports sparse
> +file semantics on the server side. If a server has only one allocation
> +context (the default), then writing to an extent which is allocated in
> +that allocation context 0 MUST NOT fail with ENOSPC.
> +
> +For all other cases, this specification requires no specific semantics
> +of allocation contexts. Implementations could support allocation
> +contexts with semantics like the following:
> +
> +- Incremental snapshots; if a block is allocated in one allocation
> +  context, that implies that it is also allocated in the next level up.
> +- Various bits of data about the backend of the storage; e.g., if the
> +  storage is written on a RAID array, an allocation context could
> +  return information about the redundancy level of a given extent
> +- If the backend implements a write-through cache of some sort, or
> +  synchronises with other servers, an allocation context could state
> +  that an extent is "allocated" once it has reached permanent storage
> +  and/or is synchronized with other servers.
> +
> +The only requirement of an allocation context is that it MUST be
> +representable with the flags as defined for `NBD_CMD_BLOCK_STATUS`.
> +
> +Likewise, the syntax of query strings is not specified by this document.
> +
> +Server implementations SHOULD document their syntax for query strings
> +and semantics for resulting allocation contexts in a document like this
> +one.
> +

But this seems strange. Whether something is 'allocated' seems orthogonal
to me to whether it needs backing up. Even the fact it's been zeroed
(now a hole) might need backing up).

So don't we need multiple independent lists of extents? Of course a server
might *implement* them under the hood with separate bitmaps or one big
bitmap, or no bitmap at all (for instance using file extents on POSIX).



-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3] doc: Add NBD_CMD_BLOCK_STATUS extension

2016-11-29 Thread Alex Bligh
ngth and exploit this
> feature for BLOCK_STATUS command, specifying bitmap identifier.
> pros: - looks like a true way
> cons: - we have to create additional extension
>   - possible we have to create a map,
> { <=> }
>  3: exteranl tool should select which bitmap to export. So, in case of 
> Qemu
> it should be something like qmp command block-export-dirty-bitmap.
> pros: - simple
>   - we can extend it to behave like (2) later
> cons: - additional qmp command to implement (possibly, the lesser 
> evil)
> note: Hmm, external tool can make chose between allocated/dirty data 
> too,
>   so, we can remove 'NBD_FLAG_STATUS_DIRTY' flag at all.

Yes, this is all pretty horrible. I suspect we want to do something like (2),
and permit extra data across (in my proposal, it would only be one byte to 
select
the index). I suppose one could ask for a list of bitmaps.

> 4. Q: Should not get_{allocated,dirty} be separate commands?
>   cons: Two commands with almost same semantic and similar means?
>   pros: However here is a good point of separating clearly defined and native
> for block devices GET_BLOCK_STATUS from user-driven and actually
> undefined data, called 'dirtyness'.

I'm suggesting one generic 'read bitmap' command like you.

> 5. Number of status descriptors, sent by server, should be restricted
>   variants:
>   1: just allow server to restrict this as it wants (which was done in v3)
>   2: (not excluding 1). Client specifies somehow the maximum for number
>  of descriptors.
>  2.1: add command flag, which will request only one descriptor
>   (otherwise, no restrictions from the client)
>  2.2: again, introduce extended nbd requests, and add field to
>   specify this maximum

I think some form of extended request is the way to go, but out of
interest, what's the issue with as many descriptors being sent as it
takes to encode the reply? The client can just consume the remainder
(without buffering) and reissue the request at a later point for
the areas it discarded.

> 
> 6. A: What to do with unspecified flags (in request/reply)?
>   I think the normal variant is to make them reserved. (Server should
>   return EINVAL if found unknown bits, client should consider replay
>   with unknown bits as an error)

Yeah.

> 
> +
> +* `NBD_CMD_BLOCK_STATUS`
> +
> +A block status query request. Length and offset define the range
> +of interest. Clients SHOULD NOT use this request unless the server

MUST NOT is what we say elsewhere I believe.

> +set `NBD_CMD_SEND_BLOCK_STATUS` in the transmission flags, which
> +in turn requires the client to first negotiate structured replies.
> +For a successful return, the server MUST use a structured reply,
> +containing at most one chunk of type `NBD_REPLY_TYPE_BLOCK_STATUS`.

Nit: are you saying that non-structured error replies are permissible?
You're always/often going to get a non-structured  (simple) error reply
if the server doesn't support the command, but I think it would be fair to say 
the
server MUST use a structured reply to NBD_CMD_SEND_BLOCK_STATUS if
it supports the command. This is effectively what we say re NBD_CMD_READ.

> +
> +The list of block status descriptors within the
> +`NBD_REPLY_TYPE_BLOCK_STATUS` chunk represent consecutive portions
> +of the file starting from specified *offset*, and the sum of the
> +*length* fields of each descriptor MUST not be greater than the
> +overall *length* of the request. This means that the server MAY
> +return less data than required. However the server MUST return at
> +least one status descriptor

I'm not sure I understand why that's useful. What should the client
infer from the server refusing to provide information? We don't
permit short reads etc.

> .  The server SHOULD use different
> +*status* values between consecutive descriptors, and SHOULD use
> +descriptor lengths that are an integer multiple of 512 bytes where
> +possible (the first and last descriptor of an unaligned query being
> +the most obvious places for an exception).

Surely better would be an an integer multiple of the minimum block
size. Being able to offer bitmap support at finer granularity than
the absolute minimum block size helps no one, and if it were possible
to support a 256 byte block size (I think some floppy disks had that)
I see no reason not to support that as a granularity.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] proto: add 'shift' extension.

2016-09-27 Thread Alex Bligh

> On 27 Sep 2016, at 00:41, Wouter Verhelst <w...@uter.be> wrote:
> 
> Thoughts?

Honestly? This whole thing seems like complication for little gain. One command 
per 2GB wiped doesn't seem like a bad thing.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] proto: add 'shift' extension.

2016-09-26 Thread Alex Bligh

> On 26 Sep 2016, at 14:54, Paolo Bonzini <pbonz...@redhat.com> wrote:
> 
> 
> Considering that NBD supports multiple outstanding requests, is it a big
> deal to require one request per terabyte of storage?

+1

Also I don't think we particularly want to make clearing the entire
disk easy to do by mistake!

This whole 'clear the whole disk in one command' seems like a
dire case of premature optimisation.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH] proto: add 'shift' extension.

2016-09-26 Thread Alex Bligh

> On 26 Sep 2016, at 13:46, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
>  Option reply types
> 
> These values are used in the "reply type" field, sent by the server
> @@ -872,7 +883,13 @@ valid may depend on negotiation during the handshake 
> phase.
>   
> [extension](https://github.com/yoe/nbd/blob/extension-write-zeroes/doc/proto.md).
> - bit 2, `NBD_CMD_FLAG_DF`; defined by the experimental `STRUCTURED_REPLY`
>   
> [extension](https://github.com/yoe/nbd/blob/extension-structured-reply/doc/proto.md).
> -
> +- bit 3, `NBD_CMD_FLAG_SHIFT`; This flag is valid for `NBD_CMD_TRIM` and
> +  `NBD_CMD_WRITE_ZEROES`. If this flag is set the server shifts request
> +  *length* and *offset* left by N bits, where N is defined by `NBD_OPT_SHIFT`
> +  option or is assumed to be 16 bits by default if `NBD_OPT_SHIFT` option is
> +  not specified. If after shift `(offset + length)` exceeds disk size, length
> +  should be reduced to `( - offset)`. However, `(offset + length)`
> +  must not exceed disk size by more than `(1 << N) - 1`.

Is there a good reason the shift size can't be either the minimum block size
or otherwise negotiated using NBD_OPT_INFO?

-- 
Alex Bligh







Re: [Qemu-devel] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> I just wanted to say, that if we want a possibility of clearing the whole 
> disk in one request for qcow2 we have to take 512 as granularity for such 
> requests (with X = 9). An this is too small. 1tb will be the upper bound for 
> the request.

Sure. But I do not see the value in optimising these huge commands to run as 
single requests. If you want to do that, do it properly and have a 
negotiation-phase flag that supports 64 bit request lengths.

> Full backup, for example:
> 
> 1. target can do fast write_zeroes: clear the whole disk (great if we can do 
> it in one request, without splitting, etc), then backup all data except zero 
> or unallocated (save a lot of time on this skipping).
> 2. target can not do fast write_zeroes: just backup all data. We need not 
> clear the disk, as we will not save time by this.
> 
> So here, we need not splitting as a general. Just clear all or not clearing 
> at all.

As I said, within the current protocol you cannot tell whether a target 
supports 'fast write zeroes', and indeed the support may be partial - for 
instance with a QCOW2 backend, a write that is not cluster aligned would likely 
only partially satisfy the command by deallocating bytes. There is no current 
flag for 'supports fast write zeroes' and (given the foregoing) it isn't 
evident to me exactly what it would mean.

It seems however you could support your use case by simply iterating through 
the backup disk, using NBD_CMD_WRITE for the areas that are allocated and 
non-zero, and using NBD_CMD_WRITE_ZEROES for the areas that are not allocated 
or zeroed. This technique would not require a protocol change (beyond the 
existing NBD_CMD_WRITE_ZEROES extension), works irrespective of whether the 
target supports write zeroes or not, works irrespective of difference in 
cluster allocation size between source and target, is far simpler, and has the 
added advantage of making the existing zeroes-but-not-holes area into holes 
(that is optional if you can tell the difference between zeroes and holes on 
the source media). It also works on a single pass. Yes, you need to split 
requests up, but you need to split requests up ANYWAY to cope with 
NBD_CMD_WRITE's 2^32-1 length limit (I strongly advise you not to use more than 
2^31). And in any case, you probably want to parallelise reads and writes and 
have more than one write in flight in any case, all of which suggests you are 
going to be breaking up requests anyway.

-- 
Alex Bligh







Re: [Qemu-devel] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 18:13, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> On 24.09.2016 19:49, Alex Bligh wrote:
>>> On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy 
>>> <vsement...@virtuozzo.com> wrote:
>>> 
>>> On 24.09.2016 19:31, Alex Bligh wrote:
>>>>> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
>>>>> <vsement...@virtuozzo.com> wrote:
>>>>> 
>>>>> Note: if disk size is not aligned to X we will have to send request 
>>>>> larger than the disk size to clear the whole disk.
>>>> If you look at the block size extension, the size of the disk must be an 
>>>> exact multiple of the minimum block size. So that would work.
> 
> This means that this extension could not be used with any qcow2 disk, as 
> qcow2 may have size not aligned to its cluster size.
> 
> # qemu-img create -f qcow2 mega 1K
> Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 
> lazy_refcounts=off refcount_bits=16
> # qemu-img info mega
> image: mega
> file format: qcow2
> virtual size: 1.0K (1024 bytes)
> disk size: 196K
> cluster_size: 65536
> Format specific information:
>compat: 1.1
>lazy refcounts: false
>refcount bits: 16
>corrupt: false
> 
> And there is no such restriction in documentation. Or we have to consider 
> sector-size (512b) as block size for qcow2, which is too small for our needs.

If by "this extension" you mean the INFO extension (which reports block sizes) 
that's incorrect.

An nbd server using a QCOW2 file as the backend would report the sector size as 
the minimum block size. It might report the cluster size or the sector size as 
the preferred block size, or anything in between.

QCOW2 cluster size essentially determines the allocation unit. NBD is not 
bothered as to the underlying allocation unit. It does not (currently) support 
the concept of making holes visible to the client. If you use 
NBD_CMD_WRITE_ZEREOS you get zeroes, which might or might not be implemented as 
one or more holes or 'real' zeroes (save if you specify NBD_CMD_FLAG_NO_HOLE in 
which case you are guaranteed to get 'real' zeroes'). If you use NBD_CMD_TRIM 
then the area trimmed might nor might not be written with one or more whole. 
There is (currently) no way to detect the presence of holes separately from 
zeroes (though a bitmap extension was discussed).

>>> But there is no guarantee that disk_size/block_size < INT_MAX..
>> I think you mean 2^32-1, but yes there is no guarantee of that. In that case 
>> you would need to break the call up into multiple calls.
>> 
>> However, being able to break the call up into multiple calls seems pretty 
>> sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of
>> time, and a REALLY long time if the server doesn't support trim.
>> 
>>> May be, additional option, specifying the shift would be better. With 
>>> convention that if offset+length exceeds disk size, length should be 
>>> recalculated as disk_size-offset.
>> I don't think we should do that. We already have clear semantics that 
>> prevent operations beyond the end of the disk. Again, just break the command 
>> up into multipl commands. No great hardship.
>> 
> 
> I agree that requests larger than disk size are ugly.. But splitting request 
> brings me again to idea of having separate command or flag for clearing the 
> whole disk without that dance. Server may report availability of this/flag 
> command only if target driver supports fast write_zeroes (qcow2 in our case).

Why? In the general case you need to break up requests anyway (particularly 
with the INFO extension where there is a maximum command size), and issuing a 
command over a TCP connection that might take hours or days to complete with no 
hint of progress, and no TCP traffic to keep NAT etc. alive, sounds like bad 
practice. The overhead is tiny.

I would be against this change.

-- 
Alex Bligh







Re: [Qemu-devel] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 17:52, Alex Bligh <a...@alex.org.uk> wrote:
> 
> In *your* use-case holes may be desirable. However in the general case, you 
> cannot assume a server supports holes. Optional support for holes isn't even 
> in the mainline spec yet (AFAIR).

You should also be aware that the minimum granularity for writes (currently 1 
byte in the spec, but the blocksize extension allows for larger) may be smaller 
than the minimum granularity for holes, which may depend on the underlying 
filing system.

A hole is in essence merely an optimised way of storing zeroes. And the trim 
operation says 'I'm not worried about this data any more - so zero it if you 
like'.

-- 
Alex Bligh







Re: [Qemu-devel] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 17:48, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
>>> Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty 
>>> much assume that a server that supports holes will write holes. A server 
>>> that does not support holes will write zeroes. If you don't care whether 
>>> the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see 
>>> above) you must be prepared for a 'thick' write of zeroes on servers that 
>>> don't support it.
>>> 
>> 
>> No, holes are critical. Concreate case: incremental backup to delta file. If 
>> we write zeroes instead of holes, we will lose underlying data (from 
>> previous incremental).
>> 
> hmm, no, sorry, that is not needed.

In *your* use-case holes may be desirable. However in the general case, you 
cannot assume a server supports holes. Optional support for holes isn't even in 
the mainline spec yet (AFAIR).

It's up to you if you choose not to connect to servers that don't support 
NBD_CMD_WRITE_ZEREOS or NBD_CMD_TRIM, but even if you take that strategy, you 
cannot guarantee that the server will not implement them by ignoring 
NBD_CMD_TRIM (this is perfectly acceptable as NBD_CMD_TRIM is advisory only) 
and making NBD_CMD_WRITE_ZEREOS do a long write of zeroes.

IE there is no way to detect whether the server actually supports holes.

-- 
Alex Bligh







Re: [Qemu-devel] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> On 24.09.2016 19:31, Alex Bligh wrote:
>>> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
>>> <vsement...@virtuozzo.com> wrote:
>>> 
>>> Note: if disk size is not aligned to X we will have to send request larger 
>>> than the disk size to clear the whole disk.
>> If you look at the block size extension, the size of the disk must be an 
>> exact multiple of the minimum block size. So that would work.
>> 
> 
> But there is no guarantee that disk_size/block_size < INT_MAX..

I think you mean 2^32-1, but yes there is no guarantee of that. In that case 
you would need to break the call up into multiple calls.

However, being able to break the call up into multiple calls seems pretty 
sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of
time, and a REALLY long time if the server doesn't support trim.

> May be, additional option, specifying the shift would be better. With 
> convention that if offset+length exceeds disk size, length should be 
> recalculated as disk_size-offset.

I don't think we should do that. We already have clear semantics that prevent 
operations beyond the end of the disk. Again, just break the command up into 
multipl commands. No great hardship.

-- 
Alex Bligh







Re: [Qemu-devel] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk 
> clearing:
> 
>  * `NBD_CMD_TRIM` (4)
> 
>  A hint to the server that the data defined by len and offset is no
>  longer needed. A server MAY discard len bytes starting at offset, but
>  is not required to.
> 
>  After issuing this command, a client MUST NOT make any assumptions
>  about the contents of the export affected by this command, until
>  overwriting it again with `NBD_CMD_WRITE`.
> 
> - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this 
> command? Or add FORCE_HOLES flag to WRITE_ZEROES?

You cannot force a hole, because NBD the is not guaranteed to support holes.

Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much 
assume that a server that supports holes will write holes. A server that does 
not support holes will write zeroes. If you don't care whether the resultant 
data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be 
prepared for a 'thick' write of zeroes on servers that don't support it.

-- 
Alex Bligh







Re: [Qemu-devel] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
> <vsement...@virtuozzo.com> wrote:
> 
> Note: if disk size is not aligned to X we will have to send request larger 
> than the disk size to clear the whole disk.

If you look at the block size extension, the size of the disk must be an exact 
multiple of the minimum block size. So that would work.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 23 Sep 2016, at 22:21, Wouter Verhelst <w...@uter.be> wrote:
> 
> On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:
>> My preference would be a new flag to the existing commands, with
>> explicit documentation that 0 offset and 0 length must be used with that
>> flag, when requesting a full-device wipe.
> 
> Alternatively, what about a flag that says "if you use this flag, the
> size should be left-shifted by X bits before processing"? That allows
> you to do TRIM or WRITE_ZEROES on much larger chunks, without being
> limited to "whole disk" commands. We should probably make it an illegal
> flag for any command that actually sends data over the wire, though.

I'd prefer an approach like this. Perhaps X could be negotiated
with the block size extension (or simply be defined as the
'preferred block size'.

This could then be defined to work with all commands.

-- 
Alex Bligh







Re: [Qemu-devel] kvm_arch_put_registers and xsave

2016-06-19 Thread Alex Bligh

On 19 Jun 2016, at 21:54, Peter Maydell <peter.mayd...@linaro.org> wrote:

> The purpose of the get/put functions, broadly, is "copy
> state from the hypervisor into QEMU's cpu state struct
> and vice-versa". The specific details are down to KVM's ABI
> (and to historical details like some KVM ioctls being newer
> or optional). For a from-scratch hypervisor I think all
> you need to do is copy the state from env->regs,
> env->xmm_regs, etc etc into the hypervisor and back,
> however is best suited to the hypervisor's APIs.

Thanks.

Hypervisor.framework provides register by register access:

https://developer.apple.com/library/mac/documentation/Hypervisor/Reference/Hypervisor_Reference/index.html#//apple_ref/doc/uid/TP40016756-CH2-DontLinkElementID_22

so I guess I should do that with every register, and ignore the explicit xsave 
support.

XMM registers and FPState (for instance) appear to be in an opaque format:

https://developer.apple.com/library/mac/documentation/Hypervisor/Reference/Hypervisor_Reference/index.html#//apple_ref/c/func/hv_vcpu_read_fpstate

-- 
Alex Bligh







[Qemu-devel] kvm_arch_put_registers and xsave

2016-06-19 Thread Alex Bligh
I'm still plugging away at my Hypervisor.Framework port (now rewritten
to be another accelerator like kvm).

It appears to be setting up memory regions, and I'm now working on
the equivalent of kvm_cpu_exec. I see an exit code 33 on the first
call, which is an invalid VMCS. This is unsurprising as I haven't
yet implemented the equivalent of kvm_arch_put_registers so the
register contents will presumably be illegal.

Looking at kvm_arch_put_registers it appears to be pretty complicated.
It appears to 'put' each individual register type, as well as putting
an xsave region. I'm a bit confused why the ordinary registers 'put'
are not then overwritten by the xsave put.

Assuming I am only targeting processors supporting XSAVE (which I
believe is reasonable given what Macs support Hypervisor.Framework),
is there a reason I shouldn't merely XRSTOR (by writing the
XSAVE region with the appropriate hv_ call) and ignore all the
other register futzing? Or is it more complicated than that
because (presumably) something sets up register states in the
cpu->regs, cpu->sregs areas? I'm a bit confused as to how all
this works to be honest. Any ideas / docs to point to?

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH v4 04/11] nbd: Improve server handling of bogus commands

2016-06-15 Thread Alex Bligh

On 15 Jun 2016, at 10:18, Paolo Bonzini <pbonz...@redhat.com> wrote:

>> So what should those servers do (like 2 of mine) which don't buffer
>> the entire read, if they get an error having already sent some data?
> 
> They have sent an error code of zero, and it turned out to be wrong.  So
> the only thing they can do safely is disconnect.

Right, but that is not what Wouter's change says:

+If an error occurs, the server SHOULD set the appropriate error code
+in the error field. The server MAY then initiate a hard disconnect.
+If it chooses not to, it MUST NOT send any payload for this request.

I read this as either

a) the server can issue a hard disconnect without sending any reply; or

b) it must send the reply header with no payload

It also seems to permit not setting the error code (it's only a 'SHOULD'),
not disconnecting (it's a MAY), then not sending any payload, which is a
nonsense.

Perhaps this should read "If an error occurs, the server MUST either initiate
a hard disconnect before the entire payload has been sent or
set the appropriate code in the error field and send the response header
without any payload." if we want to go down this route.


-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH v4 04/11] nbd: Improve server handling of bogus commands

2016-06-15 Thread Alex Bligh

> On 15 Jun 2016, at 09:03, Wouter Verhelst <w...@uter.be> wrote:
> 
> On Wed, Jun 15, 2016 at 09:05:22AM +0200, Wouter Verhelst wrote:
>> There are more clients than the Linux and qemu ones, but I think it's
>> fair to say that those two are the most important ones. If they agree
>> that a read reply which errors should come without payload, then I think
>> we should update the standard to say that, too.
> 
> I've just pushed a commit that changes the spec (and the implementation)
> so that if a server encounters a read error, it does not send a payload.
> 
> In other words, the current behaviour of qemu is correct, is now
> documented to be correct, and should not be changed.

So what should those servers do (like 2 of mine) which don't buffer
the entire read, if they get an error having already sent some data?

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [PATCH v4 04/11] nbd: Improve server handling of bogus commands

2016-06-14 Thread Alex Bligh

> On 14 Jun 2016, at 16:11, Paolo Bonzini <pbonz...@redhat.com> wrote:
> 
>> To illustrate the problem, look consider what qemu itself would do as
>> a server if it can't buffer the entire read issued to it.
> 
> Return ENOMEM?

Well OK, qemu then 'works' on the basis it breaks another
part of the spec, which is coping with long reads.

> However, it looks like the
> de facto status prior to structured replies is that the error is in the
> spec, and this patch introduces a regression.

Well, I guess the patch makes it work the same as the
reference server implementation and the spec, which I'd
consider a fix. My view is that the error is in the
kernel client. I think Erik CC'd in nbd-general
re the comment that the spec was broken; I don't think
it is, and don't propose to change it. Wouter might or
might not feel differently.

It's been reasonably well known (I wrote about it
at least 3 years ago), that the current implementation
(reference + kernel) does not cope well with errors
on reads, so I'm guessing one is just trading one
set of brokenness for another. So I'm pretty relaxed
about what goes in qemu.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v4 04/11] nbd: Improve server handling of bogus commands

2016-06-14 Thread Alex Bligh

On 14 Jun 2016, at 14:32, Paolo Bonzini <pbonz...@redhat.com> wrote:

> 
> On 13/06/2016 23:41, Alex Bligh wrote:
>> That's one of the reasons that there is a proposal to add
>> STRUCTURED_READ to the spec (although I still haven't had time to
>> implement that for qemu), so that we have a newer approach that allows
>> for proper error handling without ambiguity on whether bogus bytes must
>> be sent on a failed read.  But you'd have to convince me that ALL
>> existing NBD server and client implementations expect to handle a read
>> error without read payload, otherwise, I will stick with the notion that
>> the current spec wording is correct, and that read errors CANNOT be
>> gracefully recovered from unless BOTH sides transfer (possibly bogus)
>> bytes along with the error message, and which is why BOTH sides of the
>> protocol are warned that read errors usually result in a disconnection
>> rather than clean continuation, without the addition of STRUCTURED_READ.
> 
> I suspect that there are exactly two client implementations,

My understanding is that there are more than 2 client implementations.
A quick google found at least one BSD client. I bet read error handling
is a mess in all of them.

> namely
> Linux and QEMU's, and both do the right thing.

This depends what you mean by 'right'. Both appear to be non-compliant
with the standard.

Note the standard is not defined by the client implementation, but
by the protocol document.

IMHO the 'right thing' is what is in the spec. Servers can't send an
error in any other way if they don't buffer the entire read first, as the
read may error towards the end.

To illustrate the problem, look consider what qemu itself would do as
a server if it can't buffer the entire read issued to it.

> What servers do doesn't matter, if all the clients agree.

The spec originally was not clear on how errors on reads should be
handled, leading to any read causing a protocol drop. The spec is
now clear. Unfortunately it is not possible to make a back compatible
fix. Hence the real fix here is to implement structured replies,
which is what Eric and I have been working on.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v4 04/11] nbd: Improve server handling of bogus commands

2016-06-13 Thread Alex Bligh

On 13 Jun 2016, at 13:25, Eric Blake <ebl...@redhat.com> wrote:

> On 06/13/2016 06:10 AM, Paolo Bonzini wrote:
>> 
>> 
>> On 12/05/2016 00:39, Eric Blake wrote:
>>> - If we report an error to NBD_CMD_READ, we are not writing out
>>> any data payload; but the protocol says that a client can expect
>>> to read the payload no matter what (and must instead ignore it),
>>> which means the client will start reading our next replies as
>>> its data payload. Fix by disconnecting (an alternative fix of
>>> sending bogus payload would be trickier to implement).
>> 
>> This is an error in the spec.  The Linux driver doesn't expect to read
>> the payload here, and neither does block/nbd-client.c.
> 
> That's one of the reasons that there is a proposal to add
> STRUCTURED_READ to the spec (although I still haven't had time to
> implement that for qemu), so that we have a newer approach that allows
> for proper error handling without ambiguity on whether bogus bytes must
> be sent on a failed read.  But you'd have to convince me that ALL
> existing NBD server and client implementations expect to handle a read
> error without read payload, otherwise, I will stick with the notion that
> the current spec wording is correct, and that read errors CANNOT be
> gracefully recovered from unless BOTH sides transfer (possibly bogus)
> bytes along with the error message, and which is why BOTH sides of the
> protocol are warned that read errors usually result in a disconnection
> rather than clean continuation, without the addition of STRUCTURED_READ.

To back up what Eric said:

Unfortunately the design is pretty much broken for reporting errors
on reads (at least in part as there is no way to signal errors that
occur after some of the reply has been written).

The spec specifies that on a read, no matter whether or not there
is an error, the data is all sent. This was after some mailing
list conversations on the subject which indicated this was the
least broken way to do things (IIRC).

This is actually what nbd-server.c does in the threaded handler:
 https://github.com/yoe/nbd/blob/master/nbd-server.c#L1468

For amusement value, the non-threaded handler (which is not used
any more) does not send any payload on an error:
 https://github.com/yoe/nbd/blob/master/nbd-server.c#L1734

In essence read error handling is a horrible mess in NBD, and
I would not expect it to work in general :-(

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] OS-X hypervisor.framework support

2016-06-12 Thread Alex Bligh

On 6 Jun 2016, at 13:29, Alex Bligh <a...@alex.org.uk> wrote:

> Is anyone working on support for hypervisor.framework (OS-X's equivalent to 
> kvm as far as I can see)?
> 
> If not, I might have a go in my copious spare time (cough) and if anyone 
> would be interesting in helping, or giving advice that would be great.
> 
> I note that:
>  https://github.com/mist64/xhyve
> 
> has support for this, and its license is described as "BSD" (that's it). Is 
> that Qemu GPL compatible? Though I suspect a clean implementation may be 
> easier.

I started having a go at this.

My thesis is that's easier to emulate the KVM API and translate it
to Hypervisor.Framework calls, than modify qemu everywhere so it
calls Hypervisor.Framework everywhere directly, because there are
so many calls to the KVM API. What I have therefore done is
written some code to intercept all the kvm ioctls. As all but
a couple of these were ALREADY sent through kvm_*_ioctl() (for
tracing I presume) this was pretty easy to do. If this works, then
any speed critical bits can be rewritten natively. This avoids
splattering HVF code everywhere across qemu, which would be
a complete pain to maintain. Essentially, one can carry on
writing for kvm, and the idea is that in general it will work
for HVF too.

You thus need to compile with --enable-kvm --enable-hvf (for
now). Confusing, yes it is a bit.

I've emulated enough ioctls (who knows whether the code actually
works) to get to the point where kvm_run is called (and translated
into an hv_vcpu_run) without an ioctl being errored. Of course it
doesn't actually work at this stage (my VM exit processing is, um,
non-existent - the idea was to translate this into the kvm_run
structure, probably), and it's complaining about a corrupt
VMCS at the moment.

Whilst I haven't (yet) demonstrated that this works, I think I have
demonstrated that if I can get it work, it's a viable approach.
Modifications outside hvf specific files are pretty minimal
(tens of lines).

A significant problem is that Hypervisor.Framework has about zero
useful documentation on how to use it. I suspect one is best
reading the Intel 64-ia-32 architectures software developer
system programming manual, and guessing how Apple might have
implemented it based on that. I got to the current point without
looking seriously at xhyve, though I think to cope with vmexits
I will have to compare and contrast that to qemu.

I'd be interested in comments as to whether this is useful / mad /
wrong approach / whatever.

Code at:
  https://github.com/abligh/qemu/tree/qemu-hypervisor-framework
(based on 2.6.0)

-- 
Alex Bligh







Re: [Qemu-devel] OS-X hypervisor.framework support

2016-06-06 Thread Alex Bligh
Peter,

On 6 Jun 2016, at 13:43, Peter Maydell <peter.mayd...@linaro.org> wrote:

> Not that I am aware of.
> 
> Previous discussion:
> https://lists.gnu.org/archive/html/qemu-devel/2015-11/msg03086.html

Thanks. Missed that due to "Framwork" spelling!

-- 
Alex Bligh







[Qemu-devel] OS-X hypervisor.framework support

2016-06-06 Thread Alex Bligh
Is anyone working on support for hypervisor.framework (OS-X's equivalent to kvm 
as far as I can see)?

If not, I might have a go in my copious spare time (cough) and if anyone would 
be interesting in helping, or giving advice that would be great.

I note that:
  https://github.com/mist64/xhyve

has support for this, and its license is described as "BSD" (that's it). Is 
that Qemu GPL compatible? Though I suspect a clean implementation may be easier.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PULL 23/28] nbd: always query export list in fixed new style protocol

2016-05-17 Thread Alex Bligh

On 17 May 2016, at 17:56, Eric Blake <ebl...@redhat.com> wrote:

> That's why I wonder if we need to document a minimum cutoff at which
> clients should assume will always be serviced, and which servers should
> not treat as an attack, and whether it should be larger than 32.

I don't think we need a specific number, but I think a one sentence
statement that servers are in general permitted to disconnect clients
which are behaving in a denial of service attack type manner would be
useful.

Given nbdkit only ever has one export, it seems eminently reasonable
for it to use a lower number.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PULL 23/28] nbd: always query export list in fixed new style protocol

2016-05-17 Thread Alex Bligh

On 17 May 2016, at 17:50, Eric Blake <ebl...@redhat.com> wrote:
> 
> Option 1: I think what I would prefer is a doc patch to the NBD protocol
> that explicitly states that returning 0 names followed by NBD_REP_ACK
> (current nbdkit behavior) implies that the server doesn't honor names at
> all, and will let ANY name work for NBD_OPT_EXPORT_NAME/NBD_OPT_GO; then
> patch qemu client to treat 0 names the same as NBD_REP_UNSUP (the server
> doesn't honor names, so assume any name will work).
...
> But to date, I think ALL of the options (except NBD_OPT_EXPORT_NAME)
> _are_ optional.  In fact, I'm arguing that per Option 2, we WANT
> NBD_OPT_LIST to be optional for servers that don't care about names.

Option 3: (which I would prefer) fix the wording indicating what
options are mandatory (I think they all should be but that's another
discussion really), and in the meantime simply have nbdkit return
the empty string as an export name in NBD_OPT_LIST as you
originally suggested, which is what qemu will connect to by default,
and which is what other clients connect to by default.

I don't want to be an arse about this, but I think the standard is
meant to be be normative, IE influence behaviour, rather than just
document existing behaviour. In this instance I think nbdkit
is actually wrong (returning zero entries cannot be right as
that explicitly says nothing is exported, whereas an error is
forgiveable perhaps), and the author is OK to fix it, so I think
we should fix the software to confirm to the standard, not the
standard to conform to the software.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PULL 23/28] nbd: always query export list in fixed new style protocol

2016-05-17 Thread Alex Bligh

On 17 May 2016, at 16:54, Richard W.M. Jones <rjo...@redhat.com> wrote:

> On Tue, May 17, 2016 at 04:22:06PM +0100, Alex Bligh wrote:
>> nbdkit is non-compliant in that case. Support of NBD_OPT_LIST is
>> compulsory, even if you support it by returning a nameless export
>> (or default). Moreover support of export names is compulsory
>> (even if you have a single fixed one called "" or "default").
> 
> The protocol doc doesn't mention "default" (assuming you mean that
> literal string).  It says:

As per my message to Eric, I meant use an arbitrary piece of
text in your reply to NBD_OPT_LIST (it doesn't matter what it is,
'default' is not special). It doesn't matter because you ignore
what is passed in NBD_OPT_EXPORT_NAME. I was just suggesting making
things more readable for clients that did a list.

>> Whilst that is fair enough, NBD_OPT_LIST is a mandatory part of the
>> specification. If a server doesn't implement mandatory parts of
>> the specification, then bad things will happen.
> 
> It implements it, it's just that there wasn't a way to implement
> anything other than returning an error since we accept anything as an
> export name.

In which case you should just return an export name. Any
export name will work. The fact you don't have names for your
exports doesn't mean (in my book) you can error the list
request. You have one export. You don't know what it's name is,
but you should list it.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PULL 23/28] nbd: always query export list in fixed new style protocol

2016-05-17 Thread Alex Bligh

On 17 May 2016, at 16:52, Eric Blake <ebl...@redhat.com> wrote:

> On 05/17/2016 09:22 AM, Alex Bligh wrote:
> 
>>>> nbdkit's implemention of NBD_OPT_LIST returns an error, because there
>>>> is no such thing as a list of export names supported (in effect nbdkit
>>>> allows any export name).
>> 
>> nbdkit is non-compliant in that case. Support of NBD_OPT_LIST is
>> compulsory, even if you support it by returning a nameless export
>> (or default). Moreover support of export names is compulsory
>> (even if you have a single fixed one called "" or "default").
> 
> Where does the protocol state that? I don't see any mandatory NBD_OPT in
> the protocol, except for NBD_OPT_EXPORT_NAME (as I read it, a server may
> reply with NBD_REP_ERR_UNSUP for every other option).  I'm fine if we
> WANT to make NBD_OPT_LIST mandatory (and either document under
> NBD_OPT_LIST that NBD_REP_ERR_UNSUP cannot be used, or document under
> NBD_REP_ERR_UNSUP that it cannot be used with NBD_OPT_LIST), but that
> would make the current nbdkit non-conforming; so it might be nicer to
> make a change to the protocol document that instead permits current
> nbdkit behavior and puts the burden on clients to interoperate when
> NBD_OPT_LIST is not supported.

Under "Option Types" it says:

NBD_OPT_LIST (3) "Return zero or more NBD_REP_SERVER replies, one
for each export, followed by NBD_REP_ACK or an error (such as
NBD_REP_ERR_SHUTDOWN). The server MAY omit entries from this list if
TLS has not been negotiated, the server is operating in SELECTIVETLS
mode, and the entry concerned is a TLS-only export."

The language is imperative (if admittedly not RFC2119).

Also

The server will reply to any option apart from NBD_OPT_EXPORT_NAME
with reply packets in the following format:

again not in RFC2119 words, but says 'will' not 'MAY'.

I think the burden here is on the server to implement the protocol
as described.

I now accept that the wording should be cleared up :-)

>> This is assuming nbdkit supports 'fixed newstyle'; if nbdkit merely
>> supports 'newstyle' negotiation, then we know qemu won't connect
>> to it as modern qemu only supports servers that support 'fixed newstyle'
>> I believe.
> 
> Not quite true. Qemu as a client is perfectly fine connecting with a
> plain 'newstyle' server (and not just 'fixed newstyle'); but will limit
> itself to JUST NBD_OPT_EXPORT_NAME in that case.  So nbdkit must be
> exposing 'fixed newstyle' for qemu to be attempting NBD_OPT_LIST.

OK. I think the problem is that if nbdkit supports 'fixed newstyle'
it should really support all of the options. Alternatively it can
support just 'newstyle' and then nothing will send anything bar
NBD_OPT_EXPORT_NAME.

>>> Perhaps nbdkit should implement NBD_OPT_LIST returning just "" (the
>>> default name) as its only list entry?
>> 
>> Or "default".
> 
> As I read the protocol, I don't see "default" as a permissible name of
> the default export, just "".

Sorry, I meant that if Richard implements NBD_OPT_LIST he could either
return an export name of "" or "default". Or for that matter "nbdkit"
or "hello world". Just "default" might display better than "". As he's
ignoring whatever comes back in NBD_OPT_EXPORT_NAME it doesn't really
matter what he returns in NBD_OPT_LIST.

> Also, we currently state that NBD_OPT_LIST has zero or more
> NBD_REP_SERVER replies, which means that it is valid for the command to
> succeed with NBD_REP_ACK _without_ advertising any exports at all

Yes. This is a valid way of saying "I have no exports that work".

> (rather annoying in that it tells you nothing about whether
> NBD_OPT_EXPORT_NAME/NBD_OPT_GO will succeed anyways).

If there are no exports then NBD_OPT_GO can't be expected to work.

>  Should we reword
> that to require that if NBD_REP_ACK is sent, then at least one
> NBD_REP_SERVER reply was given (where "" is the obvious name, but at the
> same time where "" is not mandatory if some other name is given)?

I don't think so. A server might legitimately serve an empty list to
one IP and a non-empty list to another IP depending on configuration.

>>> And at some point, nbdkit should probably implement NBD_OPT_GO (which is
>>> a nicer replacement to NBD_OPT_EXPORT_NAME; I've got patches pending for
>>> qemu to implement that).
>>> 
>>> In fact, if NBD_OPT_GO is supported, then my patches to qemu won't even
>>> try NBD_OPT_LIST.
>> 
>> Sure, but NBD_OPT_INFO/GO is still an experimental protocol extension.
> 
> Hopefully not much longer; we have multiple imp

Re: [Qemu-devel] [Nbd] [PULL 23/28] nbd: always query export list in fixed new style protocol

2016-05-17 Thread Alex Bligh
Eric, Richard,

On 17 May 2016, at 16:09, Eric Blake <ebl...@redhat.com> wrote:

> [adding nbd-list]
> 
> On 05/17/2016 03:53 AM, Richard W.M. Jones wrote:
>> On Tue, Feb 16, 2016 at 05:34:41PM +0100, Paolo Bonzini wrote:
>>> From: "Daniel P. Berrange" <berra...@redhat.com>
>>> 
>>> With the new style protocol, the NBD client will currenetly
>>> send NBD_OPT_EXPORT_NAME as the first (and indeed only)
>>> option it wants. The problem is that the NBD protocol spec
>>> does not allow for returning an error message with the
>>> NBD_OPT_EXPORT_NAME option. So if the server mandates use
>>> of TLS, the client will simply see an immediate connection
>>> close after issuing NBD_OPT_EXPORT_NAME which is not user
>>> friendly.
>>> 
> 
>> I just bisected qemu 2.6 to find out where it breaks interop with
>> nbdkit, and it is this commit.
>> 
>> nbdkit implements the newstyle protocol, but doesn't implement export
>> names (yet, maybe in future).  It processes the NBD_OPT_EXPORT_NAME
>> option, but ignores the export name and carries on regardless.
>> 
>> nbdkit's implemention of NBD_OPT_LIST returns an error, because there
>> is no such thing as a list of export names supported (in effect nbdkit
>> allows any export name).

nbdkit is non-compliant in that case. Support of NBD_OPT_LIST is
compulsory, even if you support it by returning a nameless export
(or default). Moreover support of export names is compulsory
(even if you have a single fixed one called "" or "default").

This is assuming nbdkit supports 'fixed newstyle'; if nbdkit merely
supports 'newstyle' negotiation, then we know qemu won't connect
to it as modern qemu only supports servers that support 'fixed newstyle'
I believe.

> Perhaps nbdkit should implement NBD_OPT_LIST returning just "" (the
> default name) as its only list entry?

Or "default".

> And at some point, nbdkit should probably implement NBD_OPT_GO (which is
> a nicer replacement to NBD_OPT_EXPORT_NAME; I've got patches pending for
> qemu to implement that).
> 
> In fact, if NBD_OPT_GO is supported, then my patches to qemu won't even
> try NBD_OPT_LIST.

Sure, but NBD_OPT_INFO/GO is still an experimental protocol extension.

>> Therefore I don't believe the assumption made here -- that you can
>> list export names and choose them on the client side -- is a valid
>> one.
> 
> Qemu is not choosing an export name, so much as proving whether any
> OTHER errors can be detected (such as export name not present, or TLS
> required).  It _should_ be gracefully ignoring NBD_OPT_LIST errors if
> such errors don't definitively prove that the export will not succeed,
> but I guess this is not happening.  I'll see if I can tweak the qemu
> logic to be more forgiving of nbdkit's current behavior.

Whilst that is fair enough, NBD_OPT_LIST is a mandatory part of the
specification. If a server doesn't implement mandatory parts of
the specification, then bad things will happen.

My interpretation of NBD_OPT_LIST failing would be 'this server
doesn't have anything it can export'.

>> Naturally the protocol document
>> (https://github.com/yoe/nbd/blob/master/doc/proto.md) isn't clear on
>> this case.
> 
> You're right that we may also want to tweak the NBD protocol to make
> this interoperability point obvious.

I can't actually see the issue here. It explains what needs to be
implemented by the server, and that includes NBD_OPT_LIST. Very
happy to add some clarity, but I'm not sure where it's needed.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-12 Thread Alex Bligh

On 11 May 2016, at 22:06, Wouter Verhelst <w...@uter.be> wrote:

> On Tue, May 10, 2016 at 04:08:50PM +0100, Alex Bligh wrote:
>> What surprises me is that a release kernel is using experimental
>> NBD extensions; there's no guarantee these won't change. Or does
>> fstrim work some other way?
> 
> What makes you say NBD_CMD_TRIM is experimental? It's been implemented for
> years.

Me confusing it with NBD_CMD_WRITE_ZEROES which I corrected later
in the thread.

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v4 00/11] nbd: tighter protocol compliance

2016-05-12 Thread Alex Bligh

On 11 May 2016, at 23:39, Eric Blake <ebl...@redhat.com> wrote:

> Fix several corner-case bugs in our implementation of the NBD
> protocol, both as client and as server.

I thought I'd added a Reviewed-By: line to more of these before.
On a very very quick look, they all look good to me.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-12 Thread Alex Bligh

On 11 May 2016, at 22:12, Wouter Verhelst <w...@uter.be> wrote:

> On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote:
>> On 10 May 2016, at 16:29, Eric Blake <ebl...@redhat.com> wrote:
>>> So the kernel is currently one of the clients that does NOT honor block
>>> sizes, and as such, servers should be prepared for ANY size up to
>>> UINT_MAX (other than DoS handling).
>> 
>> Or not to permit a connection.
> 
> Right -- and this is why I was recommending against making this a MUST in the
> first place.

Indeed, and it currently is a 'MAY':

> except that if a server believes a client's behaviour constitutes
> a denial of service, it MAY initiate a hard disconnect.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-11 Thread Alex Bligh

On 11 May 2016, at 15:08, Eric Blake <ebl...@redhat.com> wrote:

> Then I think I will propose a doc patch to the extension-info branch
> that adds new INFO items for NBD_INFO_TRIM_SIZE and
> NBD_INFO_WRITE_ZERO_SIZE (if requested by client and replied by server,
> then this can be larger than the normal maximum size in
> NBD_INFO_BLOCK_SIZE; if either side doesn't request the info, then
> assume any maximum in NBD_INFO_BLOCK_SIZE applies, otherwise UINT32_MAX;
> and the two infos are separate items because NBD_FLAG_SEND_TRIM and
> NBD_FLAG_SEND_WRITE_ZEROES are orthogonally optional).

 ...

at this rate you'd be better with a list of command / maximum size
tuples!

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh

On 10 May 2016, at 17:04, Quentin Casasnovas <quentin.casasno...@oracle.com> 
wrote:

> Alright, I assumed by 'length of an NBD request', the specification was
> talking about the length of.. well, the request as opposed to whatever is
> in the length field of the request header.

With NBD_CMD_TRIM the length in the header field is 32 bit and specifies
the length of data to trim, not the length of the data transferred (which
is none).

> Is there a use case where you'd want to split up a single big TRIM request
> in smaller ones (as in some hardware would not support it or something)?
> Even then, it looks like this splitting up would be hardware dependant and
> better implemented in block device drivers.

Part of the point of the block size extension is to push such limits to the
client.

I could make up use cases (e.g. that performing a multi-gigabyte trim in
a single threaded server will effectively block all other I/O), but the
main reason in my book is orthogonality, and the fact the client needs
to do some breaking up anyway.

> I'm just finding odd that something that fits inside the length field can't
> be used.

That's a different point. That's Qemu's 'Denial of Service Attack'
prevention, *not* maximum block sizes. It isn't dropping it because
of a maximum block size parameter. If it doesn't support the block size
extension which the version you're looking at does not, it's meant
to handle requests up to 2^32-1 long EXCEPT that it MAY error requests
so long as to cause a denial of service attack. As this doesn't fit
into that case (it's a TRIM), it shouldn't be erroring it on that grounds.

I agree Qemu should fix that.

(So in a sense Eric and I are arguing about something irrelevant to
your current problem, which is how this would work /with/ the block
size extensions, as Eric is in the process of implementing them).

>  I do agree with your point number 3, obviously if the lenght
> field type doesn't allow something bigger than a u32, then the kernel has
> to do some breaking up in that case.


-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh

On 10 May 2016, at 16:46, Eric Blake <ebl...@redhat.com> wrote:

> Does anyone have an easy way to cause the kernel to request a trim
> operation that large on a > 4G export?  I'm not familiar enough with
> EXT4 operation to know what file system operations you can run to
> ultimately indirectly create a file system trim operation that large.
> But maybe there is something simpler - does the kernel let you use the
> fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
> FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?

Not tried it, but fallocate(1) with -p ?

http://man7.org/linux/man-pages/man1/fallocate.1.html

As it takes length and offset in TB, PB, EB, ZB and YB, it
seems to be 64 bit aware :-)

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh

On 10 May 2016, at 16:45, Quentin Casasnovas <quentin.casasno...@oracle.com> 
wrote:

> I'm by no mean an expert in this, but why would the kernel break up those
> TRIM commands?  After all, breaking things up makes sense when the length
> of the request is big, not that much when it only contains the request
> header, which is the case for TRIM commands.

1. You are assuming that the only reason for limiting the size of operations is
   limiting the data transferred within one request. That is not necessarily
   the case. There are good reasons (if only orthogonality) to have limits
   in place even where no data is transferred.

2. As and when the blocksize extension is implemented in the kernel (it isn't
   now), the protocol requires it.

3. The maximum length of an NBD trim operation is 2^32. The maximum length
   of a trim operation is larger. Therefore the kernel needs to do at least
   some breaking up.

-- 
Alex Bligh







Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh

On 10 May 2016, at 16:29, Eric Blake <ebl...@redhat.com> wrote:

> So the kernel is currently one of the clients that does NOT honor block
> sizes, and as such, servers should be prepared for ANY size up to
> UINT_MAX (other than DoS handling).

Interesting followup question:

If the kernel does not fragment TRIM requests at all (in the
same way it fragments read and write requests), I suspect
something bad may happen with TRIM requests over 2^31
in size (particularly over 2^32 in size), as the length
field in nbd only has 32 bits.

Whether it supports block size constraints or not, it is
going to need to do *some* breaking up of requests.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh
Eric,

On 10 May 2016, at 16:29, Eric Blake <ebl...@redhat.com> wrote:
>>> Maybe we should revisit that in the spec, and/or advertise yet another
>>> block size (since the maximum size for a trim and/or write_zeroes
>>> request may indeed be different than the maximum size for a read/write).
>> 
>> I think it's up to the server to either handle large requests, or
>> for the client to break these up.
> 
> But the question at hand here is whether we should permit servers to
> advertise multiple maximum block sizes (one for read/write, another one
> for trim/write_zero, or even two [at least qemu tracks a separate
> maximum trim vs. write_zero sizing in its generic block layer]), or
> merely stick with the current wording that requires clients that honor
> maximum block size to obey the same maximum for ALL commands, regardless
> of amount of data sent over the wire.

In my view, we should not change this. Block sizes maxima are not there
to support DoS prevention (that's a separate phrase). They are there
to impose maximum block sizes. Adding a different maximum block size
for different commands is way too overengineered. There are after
all reasons (especially without structured replies) why you'd want
different maximum block sizes for writes and reads. If clients support
block sizes, they will necessarily have to have the infrastructure
to break requests up.

IE maximum block size should continue to mean maximum block size.

>> 
>> The core problem here is that the kernel (and, ahem, most servers) are
>> ignorant of the block size extension, and need to guess how to break
>> things up. In my view the client (kernel in this case) should
>> be breaking the trim requests up into whatever size it uses as the
>> maximum size write requests. But then it would have to know about block
>> sizes which are in (another) experimental extension.
> 
> Correct - no one has yet patched the kernel to honor block sizes
> advertised through what is currently an experimental extension.

Unsurprising at it's still experimental, and only settled down a couple
of weeks ago :-)

>  (We
> have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
> minimum block size,

Technically that is 'out of band transmission of block size
constraints' :-)

> but I haven't audited whether the kernel actually
> guarantees that all client requests are sent aligned to the value passed
> that way - but we have nothing to set the maximum size,

indeed

> and are at the
> mercy of however the kernel currently decides to split large requests).

I am surprised TRIM doesn't get broken up the same way READ and WRITE
do.

> So the kernel is currently one of the clients that does NOT honor block
> sizes, and as such, servers should be prepared for ANY size up to
> UINT_MAX (other than DoS handling).

Or not to permit a connection.

>  My question above only applies to
> clients that use the experimental block size extensions.

Indeed.

>> What surprises me is that a release kernel is using experimental
>> NBD extensions; there's no guarantee these won't change. Or does
>> fstrim work some other way?
> 
> No extension in play.  The kernel is obeying NBD_FLAG_SEND_TRIM, which
> is in the normative standard, and unrelated to the INFO/BLOCK_SIZE
> extensions.

My mistake. I was confusing 'WRITE_ZEROES' with 'TRIM'.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh
m+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data untouched!
> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j___
> Nbd-general mailing list
> nbd-gene...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [Nbd] question on ioctl NBD_SET_FLAGS

2016-05-04 Thread Alex Bligh

On 20 Apr 2016, at 17:18, Wouter Verhelst <w...@uter.be> wrote:

> Hi Eric,
> 
> On Wed, Apr 20, 2016 at 09:42:02AM -0600, Eric Blake wrote:
> [...]
>> But in 3.9, the overlap bug was still present, and the set of global
>> flags had grown to include NBD_FLAG_NO_ZEROS (commit 038e066), which
>> overlaps with NBD_FLAG_READ_ONLY.  Ouch.  This means that a client
>> talking to a server that advertises NO_ZEROES means that the client will
>> mistakenly tell the kernel to treat EVERY export as read-only, even if
>> the client doesn't respond with NBD_FLAG_C_NO_ZEROES.
>> 
>> 3.10 fixed things; negotiate() now uses uint16_t *flags (instead of u32
>> *), and no longer tries to merge global flags with transmission flags;
>> only the transmission flags are ever passed to the kernel via
>> NBD_SET_FLAGS.  Maybe it's good that there was only 2 weeks between 3.9
>> and 3.10, so hopefully few distros are trying to ship that broken version.
> 
> Well, yeah, since 3.10 was an "oops" release when 3.9 exposed that bug
> (which indeed had existed for a while) and which was reported quite
> quickly on the list. Released versions of nbd which have the bug exist
> though, and trying to have a 3.8 (or below) client talk to a 3.9 (or
> above) server has the same issue.
> 
> I decided that there was no way in which I could fix it, and that "the
> export is readonly" is bad but not a "critical data loss" kind of bug,
> so releasing 3.10 was pretty much the only sane thing I could do (other
> than delaying NO_ZEROES, which might have worked too).

For what it's worth, Ubuntu 14.04 (still under long term
support) ships with nbd-client 3.7 and I've just had a bug
report filed against gonbdserver where all exports go readonly.
   https://github.com/abligh/gonbdserver/issues/4

I've added an option to disable NBD_FLAG_NO_ZEROS on nbd-server
but that's pretty horrible. And I think this will affect all
pre-3.10 clients talking to 3.9 and later servers.

I'm going to find a minimal patch to nbd-client and offer that
to Ubuntu / Debian.

This message is here in part so I have something to point them at
on the mailing list :-)

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3 36/44] nbd: Improve handling of shutdown requests

2016-04-25 Thread Alex Bligh

On 25 Apr 2016, at 20:20, Eric Blake <ebl...@redhat.com> wrote:

>>>} else if (fixedNewstyle) {
>> 
>> So the above is for NewStyle (not fixedNewStyle)?
> 
> The above is for fixedNewStyle when TLS is not negotiated yet; the 'else
> if' is for fixedNewStyle after TLS has been negotiated.  Prior to TLS,
> we have to special-case NBD_OPT_ABORT and actually quit.

OK. fixedNewStyle is defined as a prerequisite for TLS. I'm hoping
nothing in Qemuland ever did non-fixed NewStyle, and assuming that's
the case I would not even support it (certainly it won't play
nicely with all the other stuff you've been doing).

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-devel] [PATCH v3 44/44] nbd: Implement NBD_OPT_BLOCK_SIZE on client

2016-04-25 Thread Alex Bligh
  info->size, info->flags);
> break;
> 
> +case NBD_INFO_BLOCK_SIZE:
> +if (len != sizeof(info->min_block) * 3) {
> +error_setg(errp, "remaining export info len %" PRIu32
> +   " is unexpected size", len);
> +return -1;
> +}
> +if (read_sync(ioc, >min_block, sizeof(info->min_block)) !=
> +sizeof(info->min_block)) {
> +error_setg(errp, "failed to read info minimum block size");
> +return -1;
> +}
> +be32_to_cpus(>min_block);
> +if (!is_power_of_2(info->min_block)) {
> +error_setg(errp, "server minimum block size %" PRId32
> +   "is not a power of two", info->min_block);
> +return -1;
> +}
> +if (read_sync(ioc, >opt_block, sizeof(info->opt_block)) !=
> +sizeof(info->opt_block)) {
> +error_setg(errp, "failed to read info preferred block size");
> +return -1;
> +}
> +be32_to_cpus(>opt_block);
> +if (!is_power_of_2(info->opt_block) ||
> +info->opt_block < info->min_block) {
> +error_setg(errp, "server preferred block size %" PRId32
> +   "is not valid", info->opt_block);
> +return -1;
> +}
> +if (read_sync(ioc, >max_block, sizeof(info->max_block)) !=
> +sizeof(info->max_block)) {
> +error_setg(errp, "failed to read info maximum block size");
> +return -1;
> +}
> +be32_to_cpus(>max_block);
> +TRACE("Block sizes are 0x%" PRIx32 ", 0x%" PRIx32 ", 0x%" PRIx32,
> +  info->min_block, info->opt_block, info->max_block);
> +break;
> +

You should probably check min_block <= opt_block <= max_block here

Also should there be a check that BDRV_SECTOR_SIZE >= min_block?


> default:
> TRACE("ignoring unknown export info %" PRIu16, type);
> if (drop_sync(ioc, len) != len) {
> @@ -710,8 +769,9 @@ fail:
> #ifdef __linux__
> int nbd_init(int fd, QIOChannelSocket *sioc, NbdExportInfo *info)
> {
> -unsigned long sectors = info->size / BDRV_SECTOR_SIZE;
> -if (info->size / BDRV_SECTOR_SIZE != sectors) {
> +unsigned long sector_size = MAX(BDRV_SECTOR_SIZE, info->min_block);
> +unsigned long sectors = info->size / sector_size;
> +if (info->size / sector_size != sectors) {
> LOG("Export size %" PRId64 " too large for 32-bit kernel", 
> info->size);
>     return -E2BIG;
> }
> @@ -724,18 +784,18 @@ int nbd_init(int fd, QIOChannelSocket *sioc, 
> NbdExportInfo *info)
> return -serrno;
> }
> 
> -TRACE("Setting block size to %lu", (unsigned long)BDRV_SECTOR_SIZE);
> +TRACE("Setting block size to %lu", sector_size);
> 
> -if (ioctl(fd, NBD_SET_BLKSIZE, (unsigned long)BDRV_SECTOR_SIZE) < 0) {
> +if (ioctl(fd, NBD_SET_BLKSIZE, sector_size) < 0) {
> int serrno = errno;
> LOG("Failed setting NBD block size");
> return -serrno;
> }
> 
> TRACE("Setting size to %lu block(s)", sectors);
> -if (size % BDRV_SECTOR_SIZE) {
> -TRACE("Ignoring trailing %d bytes of export",
> -  (int) (size % BDRV_SECTOR_SIZE));
> +if (info->size % sector_size) {
> +TRACE("Ignoring trailing %" PRId64 " bytes of export",
> +  info->size % sector_size);
> }
> 
> if (ioctl(fd, NBD_SET_SIZE_BLOCKS, sectors) < 0) {
> -- 
> 2.5.5
> 
> 

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3 43/44] nbd: Implement NBD_OPT_BLOCK_SIZE on server

2016-04-25 Thread Alex Bligh
Eric,

See my message on nbd-general today re the necessity (or not)
of getting NBD_OPT_BLOCK_SIZE first; it may be just that you
can assume 512 is OK.

Otherwise

Reviewed-by: Alex Bligh <a...@alex.org.uk>

Alex

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> The upstream NBD Protocol has defined a new extension to allow
> the server to advertise block sizes to the client, as well as
> a way for the client to inform the server that it intends to
> obey block sizes.
> 
> Thanks to a recent fix, our minimum transfer size is always
> 1 (the block layer takes care of read-modify-write on our
> behalf), although if we wanted down the road, we could
> advertise a minimum of 512 based on our usage patterns to a
> client that is willing to honor block sizes.  Meanwhile,
> advertising our absolute maximum transfer size of 32M will
> help newer clients avoid EINVAL failures.
> 
> We do not reject clients for using the older NBD_OPT_EXPORT_NAME;
> we are no worse off for those clients than we used to be. But
> for clients new enough to use NBD_OPT_GO, we require the client
> to first send us NBD_OPT_BLOCK_SIZE to prove they know about
> sizing restraints, by failing with NBD_REP_ERR_BLOCK_SIZE_REQD.
> All existing released qemu clients (whether old-style or new, at
> least by the end of this series) already honor our limits, and
> will still connect; so at most, this would reject a new non-qemu
> client that doesn't intend to obey limits (and that client could
> still use NBD_OPT_EXPORT_NAME to bypass our rejection).
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> ---
> include/block/nbd.h |  2 ++
> nbd/nbd-internal.h  |  1 +
> nbd/server.c| 62 +
> 3 files changed, 65 insertions(+)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 1072d9e..a5c68df 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -96,11 +96,13 @@ typedef struct nbd_reply nbd_reply;
> #define NBD_REP_ERR_TLS_REQDNBD_REP_ERR(5)  /* TLS required */
> #define NBD_REP_ERR_UNKNOWN NBD_REP_ERR(6)  /* Export unknown */
> #define NBD_REP_ERR_SHUTDOWNNBD_REP_ERR(7)  /* Server shutting down */
> +#define NBD_REP_ERR_BLOCK_SIZE_REQD NBD_REP_ERR(8) /* Missing OPT_BLOCK_SIZE 
> */
> 
> /* Info types, used during NBD_REP_INFO */
> #define NBD_INFO_EXPORT 0
> #define NBD_INFO_NAME   1
> #define NBD_INFO_DESCRIPTION2
> +#define NBD_INFO_BLOCK_SIZE 3
> 
> /* Request flags, sent from client to server during transmission phase */
> #define NBD_CMD_FLAG_FUA(1 << 0) /* 'force unit access' during write 
> */
> diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
> index 1935102..1354182 100644
> --- a/nbd/nbd-internal.h
> +++ b/nbd/nbd-internal.h
> @@ -85,6 +85,7 @@
> #define NBD_OPT_STARTTLS(5)
> #define NBD_OPT_INFO(6)
> #define NBD_OPT_GO  (7)
> +#define NBD_OPT_BLOCK_SIZE  (9)
> 
> /* NBD errors are based on errno numbers, so there is a 1:1 mapping,
>  * but only a limited set of errno values is specified in the protocol.
> diff --git a/nbd/server.c b/nbd/server.c
> index 563afb2..86d1e2d 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -83,6 +83,7 @@ struct NBDClient {
> void (*close)(NBDClient *client);
> 
> bool no_zeroes;
> +bool block_size;
> NBDExport *exp;
> QCryptoTLSCreds *tlscreds;
> char *tlsaclname;
> @@ -346,6 +347,7 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
> uint32_t length,
> uint16_t type;
> uint64_t size;
> uint16_t flags;
> +uint32_t block;
> 
> /* Client sends:
> [20 ..  xx]   export name (length bytes)
> @@ -391,6 +393,57 @@ static int nbd_negotiate_handle_info(NBDClient *client, 
> uint32_t length,
> }
> 
> rc = nbd_negotiate_send_rep_len(client->ioc, NBD_REP_INFO, opt,
> +sizeof(type) + 3 * sizeof(block));
> +if (rc < 0) {
> +return rc;
> +}
> +
> +type = cpu_to_be16(NBD_INFO_BLOCK_SIZE);
> +if (nbd_negotiate_write(client->ioc, , sizeof(type)) !=
> +sizeof(type)) {
> +LOG("write failed");
> +return -EIO;
> +}
> +/* minimum - Always 1, because we use blk_pread().
> + * TODO: Advertise 512 if guest used NBD_OPT_BLOCK_SIZE? */
> +block = cpu_to_be32(1);
> +if (nbd_negotiate_write(client->ioc, , sizeof(block)) !=
> +sizeof(block)) {
> +LOG("write failed");
> +return -EIO;
> +}
> +/* preferred - At least 4096, but larger as appropriate. */
> +  

Re: [Qemu-devel] [PATCH v3 42/44] nbd: Implement NBD_CMD_WRITE_ZEROES on client

2016-04-25 Thread Alex Bligh

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> Upstream NBD protocol recently added the ability to efficiently
> write zeroes without having to send the zeroes over the wire,
> along with a flag to control whether the client wants a hole.
> 
> The generic block code takes care of falling back to the obvious
> write of lots of zeroes if we return -ENOTSUP because the server
> does not have WRITE_ZEROES.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
Reviewed-by: Alex Bligh <a...@alex.org.uk>
> 
> ---
> v3: rebase, tell block layer about our support
> ---
> block/nbd-client.h |  2 ++
> block/nbd-client.c | 34 ++
> block/nbd.c| 24 
> 3 files changed, 60 insertions(+)
> 
> diff --git a/block/nbd-client.h b/block/nbd-client.h
> index 0867147..07630ab 100644
> --- a/block/nbd-client.h
> +++ b/block/nbd-client.h
> @@ -46,6 +46,8 @@ void nbd_client_close(BlockDriverState *bs);
> int nbd_client_co_discard(BlockDriverState *bs, int64_t sector_num,
>   int nb_sectors);
> int nbd_client_co_flush(BlockDriverState *bs);
> +int nbd_client_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
> +   int nb_sectors, int *flags);
> int nbd_client_co_writev(BlockDriverState *bs, int64_t sector_num,
>  int nb_sectors, QEMUIOVector *qiov, int *flags);
> int nbd_client_co_readv(BlockDriverState *bs, int64_t sector_num,
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index f20219b..2b6ac27 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -291,6 +291,40 @@ int nbd_client_co_readv(BlockDriverState *bs, int64_t 
> sector_num,
> return nbd_co_readv_1(bs, sector_num, nb_sectors, qiov, offset);
> }
> 
> +int nbd_client_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
> +   int nb_sectors, int *flags)
> +{
> +ssize_t ret;
> +NbdClientSession *client = nbd_get_client_session(bs);
> +struct nbd_request request = { .type = NBD_CMD_WRITE_ZEROES };
> +struct nbd_reply reply;
> +
> +if (!(client->info.flags & NBD_FLAG_SEND_WRITE_ZEROES)) {
> +return -ENOTSUP;
> +}
> +
> +if ((*flags & BDRV_REQ_FUA) && (client->info.flags & NBD_FLAG_SEND_FUA)) 
> {
> +*flags &= ~BDRV_REQ_FUA;
> +request.flags |= NBD_CMD_FLAG_FUA;
> +}
> +if (!(*flags & BDRV_REQ_MAY_UNMAP)) {
> +request.flags |= NBD_CMD_FLAG_NO_HOLE;
> +}
> +
> +request.from = sector_num * 512;
> +request.len = nb_sectors * 512;
> +
> +nbd_coroutine_start(client, );
> +ret = nbd_co_send_request(bs, , NULL, 0);
> +if (ret < 0) {
> +reply.error = -ret;
> +} else {
> +nbd_co_receive_reply(client, , , NULL, 0);
> +}
> +nbd_coroutine_end(client, );
> +return -reply.error;
> +}
> +
> int nbd_client_co_writev(BlockDriverState *bs, int64_t sector_num,
>  int nb_sectors, QEMUIOVector *qiov, int *flags)
> {
> diff --git a/block/nbd.c b/block/nbd.c
> index 34db83e..5172039 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -355,6 +355,26 @@ static int nbd_co_readv(BlockDriverState *bs, int64_t 
> sector_num,
> return nbd_client_co_readv(bs, sector_num, nb_sectors, qiov);
> }
> 
> +static int nbd_co_write_zeroes(BlockDriverState *bs, int64_t sector_num,
> +   int nb_sectors, BdrvRequestFlags orig_flags)
> +{
> +int flags = orig_flags;
> +int ret;
> +
> +ret = nbd_client_co_write_zeroes(bs, sector_num, nb_sectors, );
> +if (ret < 0) {
> +return ret;
> +}
> +
> +/* The flag wasn't sent to the server, so we need to emulate it with an
> + * explicit flush */
> +if (flags & BDRV_REQ_FUA) {
> +ret = nbd_client_co_flush(bs);
> +}
> +
> +return ret;
> +}
> +
> static int nbd_co_writev_flags(BlockDriverState *bs, int64_t sector_num,
>int nb_sectors, QEMUIOVector *qiov, int flags)
> {
> @@ -388,6 +408,7 @@ static int nbd_co_flush(BlockDriverState *bs)
> static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
> {
> bs->bl.max_discard = UINT32_MAX >> BDRV_SECTOR_BITS;
> +bs->bl.max_write_zeroes = UINT32_MAX >> BDRV_SECTOR_BITS;
> bs->bl.max_transfer_length = UINT32_MAX >> BDRV_SECTOR_BITS;
> }
> 
> @@ -476,6 +497,7 @@ static BlockDriver bdrv_nbd = {
> .bdrv_parse_filename= nbd_parse_filename,
> .bdrv_file_open = nbd_

Re: [Qemu-devel] [PATCH v3 41/44] nbd: Implement NBD_CMD_WRITE_ZEROES on server

2016-04-25 Thread Alex Bligh

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> Upstream NBD protocol recently added the ability to efficiently
> write zeroes without having to send the zeroes over the wire,
> along with a flag to control whether the client wants a hole.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>

Reviewed-by: Alex Bligh <a...@alex.org.uk>

> 
> ---
> v3: abandon NBD_CMD_CLOSE extension, rebase to use blk_pwrite_zeroes
> ---
> include/block/nbd.h |  7 +--
> nbd/server.c| 42 --
> 2 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 05c0e48..1072d9e 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -70,6 +70,7 @@ typedef struct nbd_reply nbd_reply;
> #define NBD_FLAG_SEND_FUA   (1 << 3)/* Send FUA (Force Unit 
> Access) */
> #define NBD_FLAG_ROTATIONAL (1 << 4)/* Use elevator algorithm - 
> rotational media */
> #define NBD_FLAG_SEND_TRIM  (1 << 5)/* Send TRIM (discard) */
> +#define NBD_FLAG_SEND_WRITE_ZEROES (1 << 6) /* Send WRITE_ZEROES */
> 
> /* New-style handshake (global) flags, sent from server to client, and
>control what will happen during handshake phase. */
> @@ -102,7 +103,8 @@ typedef struct nbd_reply nbd_reply;
> #define NBD_INFO_DESCRIPTION2
> 
> /* Request flags, sent from client to server during transmission phase */
> -#define NBD_CMD_FLAG_FUA(1 << 0)
> +#define NBD_CMD_FLAG_FUA(1 << 0) /* 'force unit access' during write 
> */
> +#define NBD_CMD_FLAG_NO_HOLE(1 << 1) /* don't punch hole on zero run */
> 
> /* Supported request types */
> enum {
> @@ -110,7 +112,8 @@ enum {
> NBD_CMD_WRITE = 1,
> NBD_CMD_DISC = 2,
> NBD_CMD_FLUSH = 3,
> -NBD_CMD_TRIM = 4
> +NBD_CMD_TRIM = 4,
> +NBD_CMD_WRITE_ZEROES = 5,
> };
> 
> #define NBD_DEFAULT_PORT  10809
> diff --git a/nbd/server.c b/nbd/server.c
> index 1edb5f3..563afb2 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -689,7 +689,8 @@ static coroutine_fn int nbd_negotiate(NBDClientNewData 
> *data)
> char buf[8 + 8 + 8 + 128];
> int rc;
> const uint16_t myflags = (NBD_FLAG_HAS_FLAGS | NBD_FLAG_SEND_TRIM |
> -  NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA);
> +  NBD_FLAG_SEND_FLUSH | NBD_FLAG_SEND_FUA |
> +  NBD_FLAG_SEND_WRITE_ZEROES);
> bool oldStyle;
> size_t len;
> 
> @@ -1199,11 +1200,17 @@ static ssize_t nbd_co_receive_request(NBDRequest *req,
> rc = -EINVAL;
> goto out;
> }
> -if (request->flags & ~NBD_CMD_FLAG_FUA) {
> +if (request->flags & ~(NBD_CMD_FLAG_FUA | NBD_CMD_FLAG_NO_HOLE)) {
> LOG("unsupported flags (got 0x%x)", request->flags);
> rc = -EINVAL;
> goto out;
> }
> +if (request->type != NBD_CMD_WRITE_ZEROES &&
> +(request->flags & NBD_CMD_FLAG_NO_HOLE)) {
> +LOG("unexpected flags (got 0x%x)", request->flags);
> +rc = -EINVAL;
> +goto out;
> +}
> 
> rc = 0;
> 
> @@ -1308,6 +1315,37 @@ static void nbd_trip(void *opaque)
> }
> break;
> 
> +case NBD_CMD_WRITE_ZEROES:
> +TRACE("Request type is WRITE_ZEROES");
> +
> +if (exp->nbdflags & NBD_FLAG_READ_ONLY) {
> +TRACE("Server is read-only, return error");
> +reply.error = EROFS;
> +goto error_reply;
> +}
> +
> +TRACE("Writing to device");
> +
> +flags = 0;
> +if (request.flags & NBD_CMD_FLAG_FUA) {
> +flags |= BDRV_REQ_FUA;
> +}
> +if (!(request.flags & NBD_CMD_FLAG_NO_HOLE)) {
> +flags |= BDRV_REQ_MAY_UNMAP;
> +}
> +ret = blk_pwrite_zeroes(exp->blk, request.from + exp->dev_offset,
> +request.len, flags);
> +if (ret < 0) {
> +LOG("writing to file failed");
> +reply.error = -ret;
> +goto error_reply;
> +}
> +
> +if (nbd_co_send_reply(req, , 0) < 0) {
> +goto out;
> +}
> +break;
> +
> case NBD_CMD_DISC:
> /* unreachable, thanks to special case in nbd_co_receive_request() */
> abort();
> -- 
> 2.5.5
> 
> 

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3 40/44] nbd: Implement NBD_OPT_GO on client

2016-04-25 Thread Alex Bligh

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> NBD_OPT_EXPORT_NAME is lousy: it doesn't have any sane error
> reporting.  Upstream NBD recently added NBD_OPT_GO as the
> improved version of the option that does what we want: it
> reports sane errors on failures (including when a server
> requires TLS but does not have NBD_OPT_GO!), and on success
> it provides at least as much info as NBD_OPT_EXPORT_NAME sends.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> 

Reviewed-by: Alex Bligh <a...@alex.org.uk>

> ---
> v3: revamp to match latest version of NBD protocol
> ---
> nbd/nbd-internal.h |   3 ++
> nbd/client.c   | 120 -
> 2 files changed, 121 insertions(+), 2 deletions(-)
> 
> diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
> index c597bb8..1935102 100644
> --- a/nbd/nbd-internal.h
> +++ b/nbd/nbd-internal.h
> @@ -55,8 +55,11 @@
>  * https://github.com/yoe/nbd/blob/master/doc/proto.md
>  */
> 
> +/* Size of all NBD_OPT_*, without payload */
> #define NBD_REQUEST_SIZE(4 + 2 + 2 + 8 + 8 + 4)
> +/* Size of all NBD_REP_* sent in answer to most NBD_OPT_*, without payload */
> #define NBD_REPLY_SIZE  (4 + 4 + 8)
> +
> #define NBD_REQUEST_MAGIC   0x25609513
> #define NBD_REPLY_MAGIC 0x67446698
> #define NBD_OPTS_MAGIC  0x49484156454F5054LL
> diff --git a/nbd/client.c b/nbd/client.c
> index 89fa2c3..dac4f29 100644
> --- a/nbd/client.c
> +++ b/nbd/client.c
> @@ -222,6 +222,11 @@ static int nbd_handle_reply_err(QIOChannel *ioc, 
> nbd_opt_reply *reply,
>reply->option);
> break;
> 
> +case NBD_REP_ERR_UNKNOWN:
> +error_setg(errp, "Requested export not available for option %" 
> PRIx32,
> +   reply->option);
> +break;
> +
> case NBD_REP_ERR_SHUTDOWN:
> error_setg(errp, "Server shutting down before option %" PRIx32,
>reply->option);
> @@ -311,6 +316,103 @@ static int nbd_receive_list(QIOChannel *ioc, const char 
> *want, Error **errp)
> }
> 
> 
> +/* Returns -1 if NBD_OPT_GO proves the export @wantname cannot be
> + * used, 0 if NBD_OPT_GO is unsupported (fall back to NBD_OPT_LIST and
> + * NBD_OPT_EXPORT_NAME in that case), and > 0 if the export is good to
> + * go (with @size and @flags populated). */
> +static int nbd_opt_go(QIOChannel *ioc, const char *wantname,
> +  NbdExportInfo *info, Error **errp)
> +{
> +nbd_opt_reply reply;
> +uint32_t len;
> +uint16_t type;
> +int error;
> +
> +/* The protocol requires that the server send NBD_INFO_EXPORT with
> + * a non-zero flags (at least NBD_FLAG_HAS_FLAGS must be set); so
> + * flags still 0 is a witness of a broken server. */
> +info->flags = 0;
> +
> +TRACE("Attempting NBD_OPT_GO for export '%s'", wantname);
> +if (nbd_send_option_request(ioc, NBD_OPT_GO, -1, wantname, errp) < 0) {
> +return -1;
> +}
> +
> +TRACE("Reading export info");
> +while (1) {
> +if (nbd_receive_option_reply(ioc, NBD_OPT_GO, , errp) < 0) {
> +return -1;
> +}
> +error = nbd_handle_reply_err(ioc, , errp);
> +if (error <= 0) {
> +return error;
> +}
> +len = reply.length;
> +
> +if (reply.type == NBD_REP_ACK) {
> +/* Server is done sending info and moved into transmission
> +   phase, but make sure it sent flags */
> +if (len) {
> +error_setg(errp, "server sent invalid NBD_REP_ACK");
> +return -1;
> +}
> +if (!info->flags) {
> +error_setg(errp, "broken server omitted NBD_INFO_EXPORT");
> +return -1;
> +}
> +TRACE("export is good to go");
> +return 1;
> +}
> +if (reply.type != NBD_REP_INFO) {
> +error_setg(errp, "unexpected reply type %" PRIx32 ", expected 
> %x",
> +   reply.type, NBD_REP_INFO);
> +return -1;
> +}
> +if (len < sizeof(type)) {
> +error_setg(errp, "NBD_REP_INFO length %" PRIu32 " is too short",
> +   len);
> +return -1;
> +}
> +if (read_sync(ioc, , sizeof(type)) != sizeof(type)) {
> +error_setg(errp, "failed to read info type");
> +return -1;
> +}
> +l

Re: [Qemu-devel] [PATCH v3 36/44] nbd: Improve handling of shutdown requests

2016-04-25 Thread Alex Bligh
->error == ESHUTDOWN) {
> +LOG("server shutting down");
> +return -EINVAL;
> +}
> return 0;
> }
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index dadc928..fa6a994 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -39,6 +39,8 @@ static int system_errno_to_nbd_errno(int err)
> case EFBIG:
> case ENOSPC:
> return NBD_ENOSPC;
> +case ESHUTDOWN:
> +return NBD_ESHUTDOWN;
> case EINVAL:
> default:
> return NBD_EINVAL;
> @@ -484,6 +486,10 @@ static int nbd_negotiate_options(NBDClient *client)
> if (ret < 0) {
> return ret;
> }
> +/* Let the client keep trying, unless they asked to quit */
> +if (clientflags == NBD_OPT_ABORT) {

OK that's totally confusing. clientflags is not the client flags. clientflags
is the NBD option ID, which happens to be the two bytes after the NBD OPT magic,
which is the client flags if we were doing oldstyle negotiation, not newstyle
negotiation.

Except:

> +return -EINVAL;
> +}
> break;
> }
> } else if (fixedNewstyle) {

So the above is for NewStyle (not fixedNewStyle)?

In which case more than one option isn't even supported, so what's the
stuff purporting to handle TLS doing there?

Confused ...

> @@ -496,7 +502,15 @@ static int nbd_negotiate_options(NBDClient *client)
> break;
> 
> case NBD_OPT_ABORT:
> -return -EINVAL;
> +    /* NBD spec says we must reply before disconnecting,
> + * but that we must also tolerate guests that don't
> + * wait for our reply. */
> +ret = nbd_negotiate_send_rep(client->ioc, NBD_REP_ACK,
> + clientflags);
> +if (!ret) {
> +ret = -EINVAL;
> +}
> +return ret;
> 
> case NBD_OPT_EXPORT_NAME:
> return nbd_negotiate_handle_export_name(client, length);
> -- 
> 2.5.5
> 
> 

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3 32/44] nbd: Share common option-sending code in client

2016-04-25 Thread Alex Bligh

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> Rather than open-coding each option request, it's easier to
> have common helper functions do the work.  That in turn requires
> having convenient packed types for handling option requests
> and replies.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
> Reviewed-by: Alex Bligh <a...@alex.org.uk>
> 
> ---
> v3: rebase, tweak a debug message
> ---
> include/block/nbd.h |  29 +-
> nbd/nbd-internal.h  |   2 +-
> nbd/client.c| 250 ++--
> 3 files changed, 129 insertions(+), 152 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index f4ae86c..5227ec6 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -26,20 +26,41 @@
> #include "io/channel-socket.h"
> #include "crypto/tlscreds.h"
> 
> +/* Handshake phase structs */
> +
> +struct nbd_option {
> +uint64_t magic; /* NBD_OPTS_MAGIC */
> +uint32_t option; /* NBD_OPT_* */
> +uint32_t length;
> +} QEMU_PACKED;
> +typedef struct nbd_option nbd_option;
> +
> +struct nbd_opt_reply {
> +uint64_t magic; /* NBD_REP_MAGIC */
> +uint32_t option; /* NBD_OPT_* */
> +uint32_t type; /* NBD_REP_* */
> +uint32_t length;
> +} QEMU_PACKED;
> +typedef struct nbd_opt_reply nbd_opt_reply;
> +
> +/* Transmission phase structs */
> +
> struct nbd_request {
> -uint32_t magic;
> -uint16_t flags;
> -uint16_t type;
> +uint32_t magic; /* NBD_REQUEST_MAGIC */
> +uint16_t flags; /* NBD_CMD_FLAG_* */
> +uint16_t type; /* NBD_CMD_* */
> uint64_t handle;
> uint64_t from;
> uint32_t len;
> } QEMU_PACKED;
> +typedef struct nbd_request nbd_request;
> 
> struct nbd_reply {
> -uint32_t magic;
> +uint32_t magic; /* NBD_REPLY_MAGIC */
> uint32_t error;
> uint64_t handle;
> } QEMU_PACKED;
> +typedef struct nbd_reply nbd_reply;
> 
> /* Transmission (export) flags: sent from server to client during handshake,
>but describe what will happen during transmission */
> diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
> index d0108a1..95069db 100644
> --- a/nbd/nbd-internal.h
> +++ b/nbd/nbd-internal.h
> @@ -61,7 +61,7 @@
> #define NBD_REPLY_MAGIC 0x67446698
> #define NBD_OPTS_MAGIC  0x49484156454F5054LL
> #define NBD_CLIENT_MAGIC0x420281861253LL
> -#define NBD_REP_MAGIC   0x3e889045565a9LL
> +#define NBD_REP_MAGIC   0x0003e889045565a9LL
> 
> #define NBD_SET_SOCK_IO(0xab, 0)
> #define NBD_SET_BLKSIZE _IO(0xab, 1)
> diff --git a/nbd/client.c b/nbd/client.c
> index a9e173a..6626fa8 100644
> --- a/nbd/client.c
> +++ b/nbd/client.c
> @@ -75,64 +75,123 @@ static QTAILQ_HEAD(, NBDExport) exports = 
> QTAILQ_HEAD_INITIALIZER(exports);
> 
> */
> 
> +/* Send an option request. Return 0 if successful, -1 with errp set if
> + * it is impossible to continue. */
> +static int nbd_send_option_request(QIOChannel *ioc, uint32_t opt,
> +   uint32_t len, const char *data,
> +   Error **errp)
> +{
> +nbd_option req;
> +QEMU_BUILD_BUG_ON(sizeof(req) != 16);
> 
> -/* If type represents success, return 1 without further action.
> - * If type represents an error reply, consume the rest of the packet on ioc.
> - * Then return 0 for unsupported (so the client can fall back to
> - * other approaches), or -1 with errp set for other errors.
> +if (len == -1) {
> +req.length = len = strlen(data);
> +}
> +TRACE("Sending option request %"PRIu32", len %"PRIu32, opt, len);
> +
> +stq_be_p(, NBD_OPTS_MAGIC);
> +stl_be_p(, opt);
> +stl_be_p(, len);
> +
> +if (write_sync(ioc, , sizeof(req)) != sizeof(req)) {
> +error_setg(errp, "Failed to send option request header");
> +return -1;
> +}
> +
> +if (len && write_sync(ioc, (char *) data, len) != len) {
> +error_setg(errp, "Failed to send option request data");
> +return -1;
> +}
> +
> +return 0;
> +}
> +
> +/* Receive the header of an option reply, which should match the given
> + * opt.  Read through the length field, but NOT the length bytes of
> + * payload. Return 0 if successful, -1 with errp set if it is
> + * impossible to continue. */
> +static int nbd_receive_option_reply(QIOChannel *ioc, uint32_t opt,
> +nbd_opt_reply *reply, Error **errp)
> +{
> +QEMU_BUILD_BUG_ON(sizeof(*reply) != 20);
> +if (read_sync(

Re: [Qemu-devel] [PATCH v3 31/44] nbd: Share common reply-sending code in server

2016-04-25 Thread Alex Bligh

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> Rather than open-coding NBD_REP_SERVER, reuse the code we
> already have by adding a length parameter.  Additionally,
> the refactoring will make adding NBD_OPT_GO in a later patch
> easier.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
Reviewed-by: Alex Bligh <a...@alex.org.uk>
> 
> ---
> v3: rebase to changes earlier in series
> ---
> nbd/server.c | 48 +++-
> 1 file changed, 23 insertions(+), 25 deletions(-)
> 
> diff --git a/nbd/server.c b/nbd/server.c
> index 1d30b6d..4435d37 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -195,12 +195,15 @@ static ssize_t nbd_negotiate_drop_sync(QIOChannel *ioc, 
> size_t size)
> 
> */
> 
> -static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t 
> opt)
> +/* Send a reply header, including length, but no payload.
> + * Return -errno to kill connection, 0 to continue negotiation. */
> +static int nbd_negotiate_send_rep_len(QIOChannel *ioc, uint32_t type,
> +  uint32_t opt, uint32_t len)
> {
> uint64_t magic;
> -uint32_t len;
> 
> -TRACE("Reply opt=%" PRIx32 " type=%" PRIx32, type, opt);
> +TRACE("Reply opt=%" PRIx32 " type=%" PRIx32 " len=%" PRIu32,
> +  type, opt, len);
> 
> magic = cpu_to_be64(NBD_REP_MAGIC);
> if (nbd_negotiate_write(ioc, , sizeof(magic)) != sizeof(magic)) {
> @@ -217,7 +220,7 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, 
> uint32_t type, uint32_t opt)
> LOG("write failed (rep type)");
> return -EINVAL;
> }
> -len = cpu_to_be32(0);
> +len = cpu_to_be32(len);
> if (nbd_negotiate_write(ioc, , sizeof(len)) != sizeof(len)) {
> LOG("write failed (rep data length)");
> return -EINVAL;
> @@ -225,37 +228,32 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, 
> uint32_t type, uint32_t opt)
> return 0;
> }
> 
> +/* Send a reply header with default 0 length.
> + * Return -errno to kill connection, 0 to continue negotiation. */
> +static int nbd_negotiate_send_rep(QIOChannel *ioc, uint32_t type, uint32_t 
> opt)
> +{
> +return nbd_negotiate_send_rep_len(ioc, type, opt, 0);
> +}
> +
> +/* Send an NBD_REP_SERVER reply to NBD_OPT_LIST, including payload.
> + * Return -errno to kill connection, 0 to continue negotiation. */
> static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
> {
> -uint64_t magic;
> size_t name_len, desc_len;
> -uint32_t opt, type, len;
> +uint32_t len;
> const char *name = exp->name ? exp->name : "";
> const char *desc = exp->description ? exp->description : "";
> +int rc;
> 
> TRACE("Advertising export name '%s' description '%s'", name, desc);
> name_len = strlen(name);
> desc_len = strlen(desc);
> -magic = cpu_to_be64(NBD_REP_MAGIC);
> -if (nbd_negotiate_write(ioc, , sizeof(magic)) != sizeof(magic)) {
> -LOG("write failed (magic)");
> -return -EINVAL;
> - }
> -opt = cpu_to_be32(NBD_OPT_LIST);
> -if (nbd_negotiate_write(ioc, , sizeof(opt)) != sizeof(opt)) {
> -LOG("write failed (opt)");
> -return -EINVAL;
> -}
> -type = cpu_to_be32(NBD_REP_SERVER);
> -if (nbd_negotiate_write(ioc, , sizeof(type)) != sizeof(type)) {
> -LOG("write failed (reply type)");
> -return -EINVAL;
> -}
> -len = cpu_to_be32(name_len + desc_len + sizeof(len));
> -if (nbd_negotiate_write(ioc, , sizeof(len)) != sizeof(len)) {
> -LOG("write failed (length)");
> -return -EINVAL;
> +len = name_len + desc_len + sizeof(len);
> +rc = nbd_negotiate_send_rep_len(ioc, NBD_REP_SERVER, NBD_OPT_LIST, len);
> +if (rc < 0) {
> +return rc;
> }
> +
> len = cpu_to_be32(name_len);
> if (nbd_negotiate_write(ioc, , sizeof(len)) != sizeof(len)) {
> LOG("write failed (name length)");
> -- 
> 2.5.5
> 
> 

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3 30/44] nbd: Treat flags vs. command type as separate fields

2016-04-25 Thread Alex Bligh

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> Current upstream NBD documents that requests have a 16-bit flags,
> followed by a 16-bit type integer; although older versions mentioned
> only a 32-bit field with masking to find flags.  Since the protocol
> is in network order (big-endian over the wire), the ABI is unchanged;
> but dealing with the flags as a separate field rather than masking
> will make it easier to add support for upcoming NBD extensions that
> increase the number of both flags and commands.
> 
> Improve some comments in nbd.h based on the current upstream
> NBD protocol (https://github.com/yoe/nbd/blob/master/doc/proto.md),
> and touch some nearby code to keep checkpatch.pl happy.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
Reviewed-by: Alex Bligh <a...@alex.org.uk>
> 
> ---
> v3: rebase to other changes earlier in series
> ---
> include/block/nbd.h | 18 --
> nbd/nbd-internal.h  |  4 ++--
> block/nbd-client.c  |  9 +++--
> nbd/client.c| 17 ++---
> nbd/server.c| 51 ++-
> 5 files changed, 53 insertions(+), 46 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 2c753cc..f4ae86c 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -1,4 +1,5 @@
> /*
> + *  Copyright (C) 2016 Red Hat, Inc.
>  *  Copyright (C) 2005  Anthony Liguori <anth...@codemonkey.ws>
>  *
>  *  Network Block Device
> @@ -27,7 +28,8 @@
> 
> struct nbd_request {
> uint32_t magic;
> -uint32_t type;
> +uint16_t flags;
> +uint16_t type;
> uint64_t handle;
> uint64_t from;
> uint32_t len;
> @@ -39,6 +41,8 @@ struct nbd_reply {
> uint64_t handle;
> } QEMU_PACKED;
> 
> +/* Transmission (export) flags: sent from server to client during handshake,
> +   but describe what will happen during transmission */
> #define NBD_FLAG_HAS_FLAGS  (1 << 0)/* Flags are there */
> #define NBD_FLAG_READ_ONLY  (1 << 1)/* Device is read-only */
> #define NBD_FLAG_SEND_FLUSH (1 << 2)/* Send FLUSH */
> @@ -46,10 +50,12 @@ struct nbd_reply {
> #define NBD_FLAG_ROTATIONAL (1 << 4)/* Use elevator algorithm - 
> rotational media */
> #define NBD_FLAG_SEND_TRIM  (1 << 5)/* Send TRIM (discard) */
> 
> -/* New-style global flags. */
> +/* New-style handshake (global) flags, sent from server to client, and
> +   control what will happen during handshake phase. */
> #define NBD_FLAG_FIXED_NEWSTYLE (1 << 0)/* Fixed newstyle protocol. */
> 
> -/* New-style client flags. */
> +/* New-style client flags, sent from client to server to control what happens
> +   during handshake phase. */
> #define NBD_FLAG_C_FIXED_NEWSTYLE   (1 << 0)/* Fixed newstyle protocol. */
> 
> /* Reply types. */
> @@ -60,10 +66,10 @@ struct nbd_reply {
> #define NBD_REP_ERR_INVALID ((UINT32_C(1) << 31) | 3) /* Invalid length. 
> */
> #define NBD_REP_ERR_TLS_REQD((UINT32_C(1) << 31) | 5) /* TLS required */
> 
> +/* Request flags, sent from client to server during transmission phase */
> +#define NBD_CMD_FLAG_FUA(1 << 0)
> 
> -#define NBD_CMD_MASK_COMMAND 0x
> -#define NBD_CMD_FLAG_FUA (1 << 16)
> -
> +/* Supported request types */
> enum {
> NBD_CMD_READ = 0,
> NBD_CMD_WRITE = 1,
> diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
> index 035ead4..d0108a1 100644
> --- a/nbd/nbd-internal.h
> +++ b/nbd/nbd-internal.h
> @@ -52,10 +52,10 @@
> /* This is all part of the "official" NBD API.
>  *
>  * The most up-to-date documentation is available at:
> - * https://github.com/yoe/nbd/blob/master/doc/proto.txt
> + * https://github.com/yoe/nbd/blob/master/doc/proto.md
>  */
> 
> -#define NBD_REQUEST_SIZE(4 + 4 + 8 + 8 + 4)
> +#define NBD_REQUEST_SIZE(4 + 2 + 2 + 8 + 8 + 4)
> #define NBD_REPLY_SIZE  (4 + 4 + 8)
> #define NBD_REQUEST_MAGIC   0x25609513
> #define NBD_REPLY_MAGIC 0x67446698
> diff --git a/block/nbd-client.c b/block/nbd-client.c
> index 878e879..285025d 100644
> --- a/block/nbd-client.c
> +++ b/block/nbd-client.c
> @@ -1,6 +1,7 @@
> /*
>  * QEMU Block driver for  NBD
>  *
> + * Copyright (C) 2016 Red Hat, Inc.
>  * Copyright (C) 2008 Bull S.A.S.
>  * Author: Laurent Vivier <laurent.viv...@bull.net>
>  *
> @@ -252,7 +253,7 @@ static int nbd_co_writev_1(BlockDriverState *bs, int64_t 
> sector_num,
> 
> if ((*flags & BDRV_REQ_FUA) && (client->nbdflags & NBD_FLAG_SEND_FUA)) {

Re: [Qemu-devel] [PATCH v3 29/44] nbd: Avoid magic number for NBD max name size

2016-04-25 Thread Alex Bligh

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> Declare a constant and use that when determining if an export
> name fits within the constraints we are willing to support.
> 
> Note that upstream NBD recently documented that clients MUST
> support export names of 256 bytes (not including trailing NUL),
> and SHOULD support names up to 4096 bytes.  4096 is a bit big
> (we would lose benefits of stack-allocation of a name array),
> and we already have other limits in place (for example, qcow2
> snapshot names are clamped around 1024).  So for now, just
> stick to the required minimum.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>
Reviewed-by: Alex Bligh <a...@alex.org.uk>
> 
> ---
> v3: enlarge the limit, and document choice of new value
> ---
> include/block/nbd.h | 6 ++
> nbd/client.c| 2 +-
> nbd/server.c| 4 ++--
> 3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 3e2d76b..2c753cc 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -76,6 +76,12 @@ enum {
> 
> /* Maximum size of a single READ/WRITE data buffer */
> #define NBD_MAX_BUFFER_SIZE (32 * 1024 * 1024)
> +/* Maximum size of an export name. The NBD spec requires 256 and
> + * suggests that servers support up to 4096, but we stick to only the
> + * required size so that we can stack-allocate the names, and because
> + * going larger would require an audit of more code to make sure we
> + * aren't overflowing some other buffer. */
> +#define NBD_MAX_NAME_SIZE 256
> 
> ssize_t nbd_wr_syncv(QIOChannel *ioc,
>  struct iovec *iov,
> diff --git a/nbd/client.c b/nbd/client.c
> index 4659df3..b700100 100644
> --- a/nbd/client.c
> +++ b/nbd/client.c
> @@ -210,7 +210,7 @@ static int nbd_receive_list(QIOChannel *ioc, char **name, 
> Error **errp)
> error_setg(errp, "incorrect option name length");
> return -1;
> }
> -if (namelen > 255) {
> +if (namelen > NBD_MAX_NAME_SIZE) {
> error_setg(errp, "export name length too long %" PRIu32, namelen);
> return -1;
> }
> diff --git a/nbd/server.c b/nbd/server.c
> index fa05a73..a20bf8a 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -296,13 +296,13 @@ static int nbd_negotiate_handle_list(NBDClient *client, 
> uint32_t length)
> static int nbd_negotiate_handle_export_name(NBDClient *client, uint32_t 
> length)
> {
> int rc = -EINVAL;
> -char name[256];
> +char name[NBD_MAX_NAME_SIZE + 1];
> 
> /* Client sends:
>     [20 ..  xx]   export name (length bytes)
>  */
> TRACE("Checking length");
> -if (length > 255) {
> +if (length >= sizeof(name)) {
> LOG("Bad length received");
> goto fail;
> }
> -- 
> 2.5.5
> 
> 

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3 03/44] nbd: Improve server handling of bogus commands

2016-04-25 Thread Alex Bligh
gt; @@ -1077,14 +1105,6 @@ static void nbd_trip(void *opaque)
> goto error_reply;
> }
> command = request.type & NBD_CMD_MASK_COMMAND;
> -if (command != NBD_CMD_DISC && (request.from + request.len) > exp->size) 
> {
> -LOG("From: %" PRIu64 ", Len: %" PRIu32", Size: %" PRIu64
> -", Offset: %" PRIu64 "\n",
> -request.from, request.len,
> -(uint64_t)exp->size, (uint64_t)exp->dev_offset);
> -LOG("requested operation past EOF--bad client?");
> -goto invalid_request;
> -}
> 
> if (client->closing) {
> /*
> @@ -1151,10 +1171,11 @@ static void nbd_trip(void *opaque)
> goto out;
> }
> break;
> +
> case NBD_CMD_DISC:
> -TRACE("Request type is DISCONNECT");
> -errno = 0;
> -goto out;
> +/* unreachable, thanks to special case in nbd_co_receive_request() */
> +abort();
> +
> case NBD_CMD_FLUSH:
> TRACE("Request type is FLUSH");
> 
> @@ -1182,10 +1203,14 @@ static void nbd_trip(void *opaque)
> break;
> default:
> LOG("invalid request type (%" PRIu32 ") received", request.type);
> -invalid_request:
> reply.error = EINVAL;
> error_reply:
> -if (nbd_co_send_reply(req, , 0) < 0) {
> +/* We must disconnect after replying with an error to
> + * NBD_CMD_READ, since we choose not to send bogus filler
> + * data; likewise after NBD_CMD_WRITE if we did not read the
> + * payload. */
> +if (nbd_co_send_reply(req, , 0) < 0 || command == NBD_CMD_READ 
> ||
> +(command == NBD_CMD_WRITE && !req->complete)) {
> goto out;
> }
> break;
> -- 
> 2.5.5
> 
> 

-- 
Alex Bligh







Re: [Qemu-devel] [PATCH v3 08/44] nbd: Add qemu-nbd -D for human-readable description

2016-04-25 Thread Alex Bligh

On 23 Apr 2016, at 00:40, Eric Blake <ebl...@redhat.com> wrote:

> The NBD protocol allows servers to advertise a human-readable
> description alongside an export name during NBD_OPT_LIST.  Add
> an option to pass through the user's string to the NBD client.
> 
> Doing this also makes it easier to test commit 200650d4, which
> is the client counterpart of receiving the description.
> 
> Signed-off-by: Eric Blake <ebl...@redhat.com>

Reviewed-by: Alex Bligh <a...@alex.org.uk>

> ---
> include/block/nbd.h |  1 +
> nbd/nbd-internal.h  |  5 +++--
> nbd/server.c| 34 ++
> qemu-nbd.c  | 12 +++-
> qemu-nbd.texi   |  5 -
> 5 files changed, 45 insertions(+), 12 deletions(-)
> 
> diff --git a/include/block/nbd.h b/include/block/nbd.h
> index 134f117..3e2d76b 100644
> --- a/include/block/nbd.h
> +++ b/include/block/nbd.h
> @@ -107,6 +107,7 @@ BlockBackend *nbd_export_get_blockdev(NBDExport *exp);
> 
> NBDExport *nbd_export_find(const char *name);
> void nbd_export_set_name(NBDExport *exp, const char *name);
> +void nbd_export_set_description(NBDExport *exp, const char *description);
> void nbd_export_close_all(void);
> 
> void nbd_client_new(NBDExport *exp,
> diff --git a/nbd/nbd-internal.h b/nbd/nbd-internal.h
> index 3791535..035ead4 100644
> --- a/nbd/nbd-internal.h
> +++ b/nbd/nbd-internal.h
> @@ -103,9 +103,10 @@ static inline ssize_t read_sync(QIOChannel *ioc, void 
> *buffer, size_t size)
> return nbd_wr_syncv(ioc, , 1, 0, size, true);
> }
> 
> -static inline ssize_t write_sync(QIOChannel *ioc, void *buffer, size_t size)
> +static inline ssize_t write_sync(QIOChannel *ioc, const void *buffer,
> + size_t size)
> {
> -struct iovec iov = { .iov_base = buffer, .iov_len = size };
> +struct iovec iov = { .iov_base = (void *) buffer, .iov_len = size };
> 
> return nbd_wr_syncv(ioc, , 1, 0, size, false);
> }
> diff --git a/nbd/server.c b/nbd/server.c
> index 31fc9cf..aa252a4 100644
> --- a/nbd/server.c
> +++ b/nbd/server.c
> @@ -61,6 +61,7 @@ struct NBDExport {
> 
> BlockBackend *blk;
> char *name;
> +char *description;
> off_t dev_offset;
> off_t size;
> uint16_t nbdflags;
> @@ -128,7 +129,8 @@ static ssize_t nbd_negotiate_read(QIOChannel *ioc, void 
> *buffer, size_t size)
> 
> }
> 
> -static ssize_t nbd_negotiate_write(QIOChannel *ioc, void *buffer, size_t 
> size)
> +static ssize_t nbd_negotiate_write(QIOChannel *ioc, const void *buffer,
> +   size_t size)
> {
> ssize_t ret;
> guint watch;
> @@ -224,11 +226,15 @@ static int nbd_negotiate_send_rep(QIOChannel *ioc, 
> uint32_t type, uint32_t opt)
> 
> static int nbd_negotiate_send_rep_list(QIOChannel *ioc, NBDExport *exp)
> {
> -uint64_t magic, name_len;
> +uint64_t magic;
> +size_t name_len, desc_len;
> uint32_t opt, type, len;
> +const char *name = exp->name ? exp->name : "";
> +const char *desc = exp->description ? exp->description : "";
> 
> -TRACE("Advertising export name '%s'", exp->name ? exp->name : "");
> -name_len = strlen(exp->name);
> +TRACE("Advertising export name '%s' description '%s'", name, desc);
> +name_len = strlen(name);
> +desc_len = strlen(desc);
> magic = cpu_to_be64(NBD_REP_MAGIC);
> if (nbd_negotiate_write(ioc, , sizeof(magic)) != sizeof(magic)) {
> LOG("write failed (magic)");
> @@ -244,18 +250,22 @@ static int nbd_negotiate_send_rep_list(QIOChannel *ioc, 
> NBDExport *exp)
> LOG("write failed (reply type)");
> return -EINVAL;
> }
> -len = cpu_to_be32(name_len + sizeof(len));
> +len = cpu_to_be32(name_len + desc_len + sizeof(len));
> if (nbd_negotiate_write(ioc, , sizeof(len)) != sizeof(len)) {
> LOG("write failed (length)");
> return -EINVAL;
> }
> len = cpu_to_be32(name_len);
> if (nbd_negotiate_write(ioc, , sizeof(len)) != sizeof(len)) {
> -LOG("write failed (length)");
> +LOG("write failed (name length)");
> return -EINVAL;
> }
> -if (nbd_negotiate_write(ioc, exp->name, name_len) != name_len) {
> -LOG("write failed (buffer)");
> +if (nbd_negotiate_write(ioc, name, name_len) != name_len) {
> +LOG("write failed (name buffer)");
> +return -EINVAL;
> +}
> +if (nbd_negotiate_write(ioc, desc, desc_len) != desc_len) {
> +LOG("write failed (description buffe

  1   2   3   4   5   6   7   8   9   10   >