from:"Alex Bligh"

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-06 Thread Alex Bligh


> On 6 Oct 2016, at 11:15, Wouter Verhelst <w...@uter.be> wrote:
> 
>> 
>> 
>>  but I still think it would be helpful if the protocol helped out
>> the end user of the client and refused to negotiate multichannel
>> connections when they are unsafe. How is the end client meant to know
>> whether the back end is not on Linux, not on a block device, done
>> via a Ceph driver etc?
> 
> Well, it isn't. The server, if it provides certain functionality, should
> also provide particular guarantees. If it can't provide those
> guarantees, it should not provide that functionality.
> 
> e.g., if a server runs on a backend with cache coherency issues, it
> should not allow multiple connections to the same device, etc.

Sure. I'm simply saying that the connection flags should say "I can't
support multiple connections to this device" (available at
NBD_OPT_INFO time) rather than errorring out. This is a userspace
protocol issue.

>> I still think it's pretty damn awkward that with a ceph back end
>> (for instance) which would be one of the backends to benefit the
>> most from multichannel connections (as it's inherently parallel),
>> no one has explained how flush could be done safely.
> 
> If ceph doesn't have any way to guarantee that a write is available to
> all readers of a particular device, then it *cannot* be used to map
> block device semantics with multiple channels.

Thinking about it I believe Ceph actually may be able to do that,
it's just harder than a straightforward flush.

> Therefore, it should not
> allow writing to the device from multiple clients, period, unless the
> filesystem (or other thing) making use of the nbd device above the ceph
> layer actually understands how things may go wrong and can take care of
> it.
> 
> As such, I don't think that the problems inherent in using multiple
> connections to a ceph device (which I do not deny) have any place in a
> discussion on how NBD should work in the face of multiple channels with
> a sane/regular backend.

On which note, I am still not convinced that fsync() provides such
semantics on all operating systems and on Linux on non-block devices.
I'm not sure all those backends are 'insane'! However, if the server
could signal lack of support for multiple connections (see above)
my concerns would be significantly reduced. Note his requires no
kernel change (as you pointed out).

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-06 Thread Alex Bligh


> On 6 Oct 2016, at 11:15, Wouter Verhelst  wrote:
> 
>> 
>> 
>>  but I still think it would be helpful if the protocol helped out
>> the end user of the client and refused to negotiate multichannel
>> connections when they are unsafe. How is the end client meant to know
>> whether the back end is not on Linux, not on a block device, done
>> via a Ceph driver etc?
> 
> Well, it isn't. The server, if it provides certain functionality, should
> also provide particular guarantees. If it can't provide those
> guarantees, it should not provide that functionality.
> 
> e.g., if a server runs on a backend with cache coherency issues, it
> should not allow multiple connections to the same device, etc.

Sure. I'm simply saying that the connection flags should say "I can't
support multiple connections to this device" (available at
NBD_OPT_INFO time) rather than errorring out. This is a userspace
protocol issue.

>> I still think it's pretty damn awkward that with a ceph back end
>> (for instance) which would be one of the backends to benefit the
>> most from multichannel connections (as it's inherently parallel),
>> no one has explained how flush could be done safely.
> 
> If ceph doesn't have any way to guarantee that a write is available to
> all readers of a particular device, then it *cannot* be used to map
> block device semantics with multiple channels.

Thinking about it I believe Ceph actually may be able to do that,
it's just harder than a straightforward flush.

> Therefore, it should not
> allow writing to the device from multiple clients, period, unless the
> filesystem (or other thing) making use of the nbd device above the ceph
> layer actually understands how things may go wrong and can take care of
> it.
> 
> As such, I don't think that the problems inherent in using multiple
> connections to a ceph device (which I do not deny) have any place in a
> discussion on how NBD should work in the face of multiple channels with
> a sane/regular backend.

On which note, I am still not convinced that fsync() provides such
semantics on all operating systems and on Linux on non-block devices.
I'm not sure all those backends are 'insane'! However, if the server
could signal lack of support for multiple connections (see above)
my concerns would be significantly reduced. Note his requires no
kernel change (as you pointed out).

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-06 Thread Alex Bligh

Wouter,

> On 6 Oct 2016, at 10:04, Wouter Verhelst <w...@uter.be> wrote:
> 
> Hi Alex,
> 
> On Tue, Oct 04, 2016 at 10:35:03AM +0100, Alex Bligh wrote:
>> Wouter,
>>> I see now that it should be closer
>>> to the former; a more useful definition is probably something along the
>>> following lines:
>>> 
>>>   All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM)
>>>   for which a reply was received on the client side prior to the
>> 
>> No, that's wrong as the server has no knowledge of whether the client
>> has actually received them so no way of knowing to which writes that
>> would reply.
> 
> I realise that, but I don't think it's a problem.
> 
> In the current situation, a client could opportunistically send a number
> of write requests immediately followed by a flush and hope for the best.
> However, in that case there is no guarantee that for the write requests
> that the client actually cares about to have hit the disk, a reply
> arrives on the client side before the flush reply arrives. If that
> doesn't happen, that would then mean the client would have to issue
> another flush request, probably at a performance hit.

Sure, but the client knows (currently) that any write request which
it has a reply to before it receives the reply from the flush request
has been written to disk. Such a client might simply note whether it
has issued any subsequent write requests.

> As I understand Christoph's explanations, currently the Linux kernel
> *doesn't* issue flush requests unless and until the necessary writes
> have already completed (i.e., the reply has been received and processed
> on the client side).

Sure, but it is not the only client.

> Given that, given the issue in the previous
> paragraph, and given the uncertainty introduced with multiple
> connections, I think it is reasonable to say that a client should just
> not assume a flush touches anything except for the writes for which it
> has already received a reply by the time the flush request is sent out.

OK. So you are proposing weakening the semantic for flush (saying that
it is only guaranteed to cover those writes for which the client has
actually received a reply prior to sending the flush, as opposed to
prior to receiving the flush reply). This is based on the view that
the Linux kernel client wouldn't be affected, and if other clients
were affected, their behaviour would be 'somewhat unusual'.

We do have one significant other client out there that uses flush
which is Qemu. I think we should get a view on whether they would be
affected.

> Those are semantics that are actually useful and can be guaranteed in
> the face of multiple connections. Other semantics can not.

Well there is another semantic which would work just fine, and also
cures the other problem (synchronisation between channels) which would
be simply that flush is only guaranteed to affect writes issued on the
same channel. Then flush would do the natural thing, i.e. flush
all the writes that had been done *on that channel*.

> It is indeed impossible for a server to know what has been received by
> the client by the time it (the client) sent out the flush request.
> However, the server doesn't need that information, at all. The flush
> request's semantics do not say that any request not covered by the flush
> request itself MUST NOT have hit disk; instead, it just says that there
> is no guarantee on whether or not that is the case. That's fine; all a
> server needs to know is that when it receives a flush, it needs to
> fsync() or some such, and then send the reply. All a *client* needs to
> know is which requests have most definitely hit the disk. In my
> proposal, those are the requests that finished before the flush request
> was sent, and not the requests that finished between that and when the
> flush reply is received. Those are *likely* to also be covered
> (especially on single-connection NBD setups), but in my proposal,
> they're no longer *guaranteed* to be.

I think my objection was more that you were writing mandatory language
for a server's behaviour based on what the client perceives.

What you are saying from the client's point of view is that it under
your proposed change it can only rely on that writes in respect of
which the reply has been received prior to issuing the flush are persisted
to disk (more might be persisted, but the client can't rely on it).

So far so good.

However, I don't think you can usefully make the guarantee weaker from the
SERVER'S point of view, because it doesn't know how things got reordered.
IE it still needs to persist to disk any write that it has completed
when it processes the flush. Yes, the client doesn't get the same guarantee,
but it can't know whether it can be slacker about a particular wri

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-06 Thread Alex Bligh

Wouter,

> On 6 Oct 2016, at 10:04, Wouter Verhelst  wrote:
> 
> Hi Alex,
> 
> On Tue, Oct 04, 2016 at 10:35:03AM +0100, Alex Bligh wrote:
>> Wouter,
>>> I see now that it should be closer
>>> to the former; a more useful definition is probably something along the
>>> following lines:
>>> 
>>>   All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM)
>>>   for which a reply was received on the client side prior to the
>> 
>> No, that's wrong as the server has no knowledge of whether the client
>> has actually received them so no way of knowing to which writes that
>> would reply.
> 
> I realise that, but I don't think it's a problem.
> 
> In the current situation, a client could opportunistically send a number
> of write requests immediately followed by a flush and hope for the best.
> However, in that case there is no guarantee that for the write requests
> that the client actually cares about to have hit the disk, a reply
> arrives on the client side before the flush reply arrives. If that
> doesn't happen, that would then mean the client would have to issue
> another flush request, probably at a performance hit.

Sure, but the client knows (currently) that any write request which
it has a reply to before it receives the reply from the flush request
has been written to disk. Such a client might simply note whether it
has issued any subsequent write requests.

> As I understand Christoph's explanations, currently the Linux kernel
> *doesn't* issue flush requests unless and until the necessary writes
> have already completed (i.e., the reply has been received and processed
> on the client side).

Sure, but it is not the only client.

> Given that, given the issue in the previous
> paragraph, and given the uncertainty introduced with multiple
> connections, I think it is reasonable to say that a client should just
> not assume a flush touches anything except for the writes for which it
> has already received a reply by the time the flush request is sent out.

OK. So you are proposing weakening the semantic for flush (saying that
it is only guaranteed to cover those writes for which the client has
actually received a reply prior to sending the flush, as opposed to
prior to receiving the flush reply). This is based on the view that
the Linux kernel client wouldn't be affected, and if other clients
were affected, their behaviour would be 'somewhat unusual'.

We do have one significant other client out there that uses flush
which is Qemu. I think we should get a view on whether they would be
affected.

> Those are semantics that are actually useful and can be guaranteed in
> the face of multiple connections. Other semantics can not.

Well there is another semantic which would work just fine, and also
cures the other problem (synchronisation between channels) which would
be simply that flush is only guaranteed to affect writes issued on the
same channel. Then flush would do the natural thing, i.e. flush
all the writes that had been done *on that channel*.

> It is indeed impossible for a server to know what has been received by
> the client by the time it (the client) sent out the flush request.
> However, the server doesn't need that information, at all. The flush
> request's semantics do not say that any request not covered by the flush
> request itself MUST NOT have hit disk; instead, it just says that there
> is no guarantee on whether or not that is the case. That's fine; all a
> server needs to know is that when it receives a flush, it needs to
> fsync() or some such, and then send the reply. All a *client* needs to
> know is which requests have most definitely hit the disk. In my
> proposal, those are the requests that finished before the flush request
> was sent, and not the requests that finished between that and when the
> flush reply is received. Those are *likely* to also be covered
> (especially on single-connection NBD setups), but in my proposal,
> they're no longer *guaranteed* to be.

I think my objection was more that you were writing mandatory language
for a server's behaviour based on what the client perceives.

What you are saying from the client's point of view is that it under
your proposed change it can only rely on that writes in respect of
which the reply has been received prior to issuing the flush are persisted
to disk (more might be persisted, but the client can't rely on it).

So far so good.

However, I don't think you can usefully make the guarantee weaker from the
SERVER'S point of view, because it doesn't know how things got reordered.
IE it still needs to persist to disk any write that it has completed
when it processes the flush. Yes, the client doesn't get the same guarantee,
but it can't know whether it can be slacker about a particular write
which it has perfo

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-04 Thread Alex Bligh

ose). So
to the client it means that the server has persisted to disk all
those commands to which it has received a reply. However, to the
server, the 'MUST' condition needs to refer to the replies it sent,
not the replies the client receives.

I think "before processing" here really just means "before sending
a reply to the NBD_CMD_FLUSH". I believe the 'before processing'
phrase came from the kernel wording.

> I see now that it should be closer
> to the former; a more useful definition is probably something along the
> following lines:
> 
>All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM)
>for which a reply was received on the client side prior to the

No, that's wrong as the server has no knowledge of whether the client
has actually received them so no way of knowing to which writes that
would reply. It thus has to ensure it covers a potentially larger
set of writes - those for which a reply has been sent, as all those
MIGHT have been received by the client.

>transmission of the NBD_CMD_FLUSH message MUST be written to

no that's wrong because the server has no idea when the client
transmitted the NBD_CMD_FLUSH message. It can only deal with
events it knows about. And the delimiter is (in essence) those
write commands that were replied to prior to the sending of the
reply to the NBD_CMD_FLUSH - of course it can flush others too.

>non-volatile storage prior to replying to that NBD_CMD_FLUSH. A
>server MAY process this command in ways that result committing more
>data to non-volatile storage than is strictly required.

I think the wording is basically right for the current semantic,
but here's a slight improvement:

   The server MUST NOT send a non-error reply to NBD_CMD_FLUSH
   until it has ensured that the contents of all writes to which it
   has already completed (i.e. replied to) have been persisted
   to non-volatile storage.

However, given that the replies can subsequently be reordered, I
now think we do have a problem, as the client can't tell to which
those refer.

> [...]
>> I don't think there is actually a problem here - Wouter if I'm wrong
>> about this, I'd like to understand your argument better.
> 
> No, I now see that there isn't, and I misunderstood things. However, I
> do think we should update the spec to clarify this.

Haha - I now think there is. You accidentally convinced me!

>> b) What I'm describing - which is the lack of synchronisation between
>> channels.
> [... long explanation snipped...]
> 
> Yes, and I acknowledge that. However, I think that should not be a
> blocker. It's fine to mark this feature as experimental; it will not
> ever be required to use multiple connections to connect to a server.
> 
> When this feature lands in nbd-client, I plan to ensure that the man
> page and -help output says something along the following lines:
> 
> use N connections to connect to the NBD server, improving performance
> at the cost of a possible loss of reliability.

So in essence we are relying on (userspace) nbd-client not to open
more connections if it's unsafe? IE we can sort out all the negotiation
of whether it's safe or unsafe within userspace and not bother Josef
about it? I suppose that's fine in that we can at least shorten
the CC: line, but I still think it would be helpful if the protocol

>> Now, in the reference server, NBD_CMD_FLUSH is implemented through an
>> fdatasync().
> 
> Actually, no, the reference server uses fsync() for reasons that I've
> forgotten (side note: you wrote it that way ;-)
> 
> [...]

I vaguely remember why - something to do with files expanding when
holes were written to. However, I don't think that makes much
difference to the question I asked, or at most s/fdatasync/fsync/g

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-04 Thread Alex Bligh

ose). So
to the client it means that the server has persisted to disk all
those commands to which it has received a reply. However, to the
server, the 'MUST' condition needs to refer to the replies it sent,
not the replies the client receives.

I think "before processing" here really just means "before sending
a reply to the NBD_CMD_FLUSH". I believe the 'before processing'
phrase came from the kernel wording.

> I see now that it should be closer
> to the former; a more useful definition is probably something along the
> following lines:
> 
>All write commands (that includes NBD_CMD_WRITE and NBD_CMD_TRIM)
>for which a reply was received on the client side prior to the

No, that's wrong as the server has no knowledge of whether the client
has actually received them so no way of knowing to which writes that
would reply. It thus has to ensure it covers a potentially larger
set of writes - those for which a reply has been sent, as all those
MIGHT have been received by the client.

>transmission of the NBD_CMD_FLUSH message MUST be written to

no that's wrong because the server has no idea when the client
transmitted the NBD_CMD_FLUSH message. It can only deal with
events it knows about. And the delimiter is (in essence) those
write commands that were replied to prior to the sending of the
reply to the NBD_CMD_FLUSH - of course it can flush others too.

>non-volatile storage prior to replying to that NBD_CMD_FLUSH. A
>server MAY process this command in ways that result committing more
>data to non-volatile storage than is strictly required.

I think the wording is basically right for the current semantic,
but here's a slight improvement:

   The server MUST NOT send a non-error reply to NBD_CMD_FLUSH
   until it has ensured that the contents of all writes to which it
   has already completed (i.e. replied to) have been persisted
   to non-volatile storage.

However, given that the replies can subsequently be reordered, I
now think we do have a problem, as the client can't tell to which
those refer.

> [...]
>> I don't think there is actually a problem here - Wouter if I'm wrong
>> about this, I'd like to understand your argument better.
> 
> No, I now see that there isn't, and I misunderstood things. However, I
> do think we should update the spec to clarify this.

Haha - I now think there is. You accidentally convinced me!

>> b) What I'm describing - which is the lack of synchronisation between
>> channels.
> [... long explanation snipped...]
> 
> Yes, and I acknowledge that. However, I think that should not be a
> blocker. It's fine to mark this feature as experimental; it will not
> ever be required to use multiple connections to connect to a server.
> 
> When this feature lands in nbd-client, I plan to ensure that the man
> page and -help output says something along the following lines:
> 
> use N connections to connect to the NBD server, improving performance
> at the cost of a possible loss of reliability.

So in essence we are relying on (userspace) nbd-client not to open
more connections if it's unsafe? IE we can sort out all the negotiation
of whether it's safe or unsafe within userspace and not bother Josef
about it? I suppose that's fine in that we can at least shorten
the CC: line, but I still think it would be helpful if the protocol

>> Now, in the reference server, NBD_CMD_FLUSH is implemented through an
>> fdatasync().
> 
> Actually, no, the reference server uses fsync() for reasons that I've
> forgotten (side note: you wrote it that way ;-)
> 
> [...]

I vaguely remember why - something to do with files expanding when
holes were written to. However, I don't think that makes much
difference to the question I asked, or at most s/fdatasync/fsync/g

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-03 Thread Alex Bligh

Josef,

> On 3 Oct 2016, at 15:32, Josef Bacik <jba...@fb.com> wrote:
> 
> Ok I understand your objections now.  You aren't arguing that we are unsafe 
> by default, only that we are unsafe with servers that do something special 
> beyond simply writing to a single disk or file.  I agree this is problematic, 
> but you simply don't use this feature if your server can't deal with it well.

Not quite. I am arguing that:

1. Parallel channels as currently implemented are inherently unsafe on some 
servers - I have given an example of such servers

2. Parallel channels as currently implemented may be unsafe on the reference 
server on some or all platforms (depending on the behaviour of fdatasync() 
which might varies between platforms)

3. Either the parallel channels stuff should issue flush requests on all 
channels, or the protocol should protect the unwitting user by negotiating an 
option to do so (or refusing multi-channel connects). It is not reasonable for 
an nbd client to have to know the intimate details of the server and its 
implementation of synchronisation primitives - saying 'the user should disable 
multiple channels' is not good enough, as this is what we have a protocol for.

"unsafe" here means 'do not conform to the kernel's requirements for flush 
semantics, whereas they did without multiple channels'

So from my point of view either we need to have an additional connection flag 
("don't use multiple channels unless you are going to issue a flush on all 
channels") OR flush on all channels should be the default.

Whatever SOME more work needs to be done on your patch because there it current 
doesn't issue flush on all channels, and it currently does not have any way to 
prevent use of multiple channels.

I'd really like someone with linux / POSIX knowledge to confirm the linux and 
POSIX position on fdatasync of an fd opened by one process on writes made to 
the same file by another process. If on all POSIX platforms and all filing 
systems this is guaranteed to flush the second platform's writes, then I think 
we could argue "OK there may be some weird servers which might not support 
multiple channels and they just need a way of signalling that". If on the other 
hand there is no such cross platform guarantee, I think this means in essence 
even with the reference server, this patch is unsafe, and it needs adapting to 
send flushes on all channels - yes it might theoretically be possible to 
introduce IPC to the current server, but you'd still need some way of tying 
together channels from a single client.

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-03 Thread Alex Bligh

Josef,

> On 3 Oct 2016, at 15:32, Josef Bacik  wrote:
> 
> Ok I understand your objections now.  You aren't arguing that we are unsafe 
> by default, only that we are unsafe with servers that do something special 
> beyond simply writing to a single disk or file.  I agree this is problematic, 
> but you simply don't use this feature if your server can't deal with it well.

Not quite. I am arguing that:

1. Parallel channels as currently implemented are inherently unsafe on some 
servers - I have given an example of such servers

2. Parallel channels as currently implemented may be unsafe on the reference 
server on some or all platforms (depending on the behaviour of fdatasync() 
which might varies between platforms)

3. Either the parallel channels stuff should issue flush requests on all 
channels, or the protocol should protect the unwitting user by negotiating an 
option to do so (or refusing multi-channel connects). It is not reasonable for 
an nbd client to have to know the intimate details of the server and its 
implementation of synchronisation primitives - saying 'the user should disable 
multiple channels' is not good enough, as this is what we have a protocol for.

"unsafe" here means 'do not conform to the kernel's requirements for flush 
semantics, whereas they did without multiple channels'

So from my point of view either we need to have an additional connection flag 
("don't use multiple channels unless you are going to issue a flush on all 
channels") OR flush on all channels should be the default.

Whatever SOME more work needs to be done on your patch because there it current 
doesn't issue flush on all channels, and it currently does not have any way to 
prevent use of multiple channels.

I'd really like someone with linux / POSIX knowledge to confirm the linux and 
POSIX position on fdatasync of an fd opened by one process on writes made to 
the same file by another process. If on all POSIX platforms and all filing 
systems this is guaranteed to flush the second platform's writes, then I think 
we could argue "OK there may be some weird servers which might not support 
multiple channels and they just need a way of signalling that". If on the other 
hand there is no such cross platform guarantee, I think this means in essence 
even with the reference server, this patch is unsafe, and it needs adapting to 
send flushes on all channels - yes it might theoretically be possible to 
introduce IPC to the current server, but you'd still need some way of tying 
together channels from a single client.

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-03 Thread Alex Bligh

M) ***ASSOCIATED 
WITH ANY CLIENT AT ALL*** that the server completes (i.e. replies to) prior to 
processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to 
replying to that NBD_CMD_FLUSH" - i.e. a flush on any channel of any client 
must flush every channel of every client, because we have no easy way to tell 
which clients are in fact two channels. I have concerns over the scalability of 
this.

Now, in the reference server, NBD_CMD_FLUSH is implemented through an 
fdatasync(). Each client (and therefore each channel) runs in a different 
process.

Earlier in this thread, someone suggested that if this happens:

  Process AProcess B
  ==

  fd1=open("file123")
   fd2=open("file123")

  write(fd1, ...)
   fdatasync("fd2")

then the fdatasync() is guaranteed to sync the write that Process A has 
written. This may or may not be the case under Linux (wiser minds than me will 
know). Is it guaranteed to be the case with (e.g.) the file on NFS? On all 
POSIX platforms?

Looking at 
http://pubs.opengroup.org/onlinepubs/009695399/functions/fdatasync.html I'd say 
it was a little ambiguous as to whether it will ALWAYS flush all data 
associated with the file even if it is being written by a different process 
(and a different FD).

If fdatasync() is always guaranteed to flush data associated with writes by a 
different process (and separately opened fd), then as it happens there won't be 
a problem on the reference server, just on servers that don't happen to use 
fdatasync() or similar to implement flushes, and which don't maintain their own 
caches. If fdatasync() is not so guaranteed, we have a problem with the 
reference server too, at least on some platforms and fling systems.

What I'm therefore asking for is either:
a) that the server can say 'if you are multichannel, you will need to send 
flush on each channel' (best); OR
b) that the server can say 'don't go multichannel'

as part of the negotiation stage. Of course as this is dependent on the 
backend, this is going to be something that is per-target (i.e. needs to come 
as a transmission flag or similar).

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-03 Thread Alex Bligh

LIENT AT ALL*** that the server completes (i.e. replies to) prior to 
processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to 
replying to that NBD_CMD_FLUSH" - i.e. a flush on any channel of any client 
must flush every channel of every client, because we have no easy way to tell 
which clients are in fact two channels. I have concerns over the scalability of 
this.

Now, in the reference server, NBD_CMD_FLUSH is implemented through an 
fdatasync(). Each client (and therefore each channel) runs in a different 
process.

Earlier in this thread, someone suggested that if this happens:

  Process AProcess B
  ==

  fd1=open("file123")
   fd2=open("file123")

  write(fd1, ...)
   fdatasync("fd2")

then the fdatasync() is guaranteed to sync the write that Process A has 
written. This may or may not be the case under Linux (wiser minds than me will 
know). Is it guaranteed to be the case with (e.g.) the file on NFS? On all 
POSIX platforms?

Looking at 
http://pubs.opengroup.org/onlinepubs/009695399/functions/fdatasync.html I'd say 
it was a little ambiguous as to whether it will ALWAYS flush all data 
associated with the file even if it is being written by a different process 
(and a different FD).

If fdatasync() is always guaranteed to flush data associated with writes by a 
different process (and separately opened fd), then as it happens there won't be 
a problem on the reference server, just on servers that don't happen to use 
fdatasync() or similar to implement flushes, and which don't maintain their own 
caches. If fdatasync() is not so guaranteed, we have a problem with the 
reference server too, at least on some platforms and fling systems.

What I'm therefore asking for is either:
a) that the server can say 'if you are multichannel, you will need to send 
flush on each channel' (best); OR
b) that the server can say 'don't go multichannel'

as part of the negotiation stage. Of course as this is dependent on the 
backend, this is going to be something that is per-target (i.e. needs to come 
as a transmission flag or similar).

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-02 Thread Alex Bligh


> On 29 Sep 2016, at 17:59, Josef Bacik <jba...@fb.com> wrote:
> 
> On 09/29/2016 12:41 PM, Wouter Verhelst wrote:
>> On Thu, Sep 29, 2016 at 10:03:50AM -0400, Josef Bacik wrote:
>>> So think of it like normal disks with multiple channels.  We don't send 
>>> flushes
>>> down all the hwq's to make sure they are clear, we leave that decision up 
>>> to the
>>> application (usually a FS of course).
>> 
>> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
>> when a FLUSH is sent over one channel, and the reply comes in, that all
>> commands which have been received, regardless of which channel they were
>> received over, have reched disk.
>> 
>> [1] Message-ID: <20160915122304.ga15...@infradead.org>
>> 
>> It is impossible for nbd to make such a guarantee, due to head-of-line
>> blocking on TCP.
>> 
> 
> Huh I missed that.  Yeah that's not possible for us for sure, I think my 
> option 
> idea is the less awful way forward if we want to address that limitation.  
> Thanks,

I think if the server supports flush (which you can tell), sending flush on
all channels is the only safe thing to do, without substantial protocol
changes (which I'm not sure how one would do given flush is in a sense
a synchronisation point). I think it's thus imperative this gets fixed
before the change gets merged.

-- 
Alex Bligh

Re: [Nbd] [PATCH][V3] nbd: add multi-connection support

2016-10-02 Thread Alex Bligh


> On 29 Sep 2016, at 17:59, Josef Bacik  wrote:
> 
> On 09/29/2016 12:41 PM, Wouter Verhelst wrote:
>> On Thu, Sep 29, 2016 at 10:03:50AM -0400, Josef Bacik wrote:
>>> So think of it like normal disks with multiple channels.  We don't send 
>>> flushes
>>> down all the hwq's to make sure they are clear, we leave that decision up 
>>> to the
>>> application (usually a FS of course).
>> 
>> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
>> when a FLUSH is sent over one channel, and the reply comes in, that all
>> commands which have been received, regardless of which channel they were
>> received over, have reched disk.
>> 
>> [1] Message-ID: <20160915122304.ga15...@infradead.org>
>> 
>> It is impossible for nbd to make such a guarantee, due to head-of-line
>> blocking on TCP.
>> 
> 
> Huh I missed that.  Yeah that's not possible for us for sure, I think my 
> option 
> idea is the less awful way forward if we want to address that limitation.  
> Thanks,

I think if the server supports flush (which you can tell), sending flush on
all channels is the only safe thing to do, without substantial protocol
changes (which I'm not sure how one would do given flush is in a sense
a synchronisation point). I think it's thus imperative this gets fixed
before the change gets merged.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter,

> On 15 Sep 2016, at 17:27, Wouter Verhelst <w...@uter.be> wrote:
> 
> On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
>> Wouter,
>> 
>>> The server can always refuse to allow multiple connections.
>> 
>> Sure, but it would be neater to warn the client of that at negotiation
>> stage (it would only be one flag, e.g.  'multiple connections
>> unsafe').
> 
> I suppose that's not a bad idea.

Good.

> [...]
>>> I was thinking of changing the spec as follows:
>>> 
>>> diff --git a/doc/proto.md b/doc/proto.md
>>> index 217f57e..cb099e2 100644
>>> --- a/doc/proto.md
>>> +++ b/doc/proto.md
>>> @@ -308,6 +308,23 @@ specification, the
>>> [kernel 
>>> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
>>> may be useful.
>>> 
>>> +For performance reasons, clients MAY open multiple connections to the
>>> +same server. To support such clients, servers SHOULD ensure that at
>>> +least one of the following conditions hold:
>>> +
>>> +* Flush commands are processed for ALL connections. That is, when an
>>> +  `NBD_CMD_WRITE` is processed on one connection, and then an
>>> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
>>> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
>>> +  before the reply of the `NBD_CMD_FLUSH` is sent.
>>> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
>>> +  connection
>>> +* Multiple connections are not allowed
>>> +
>>> +In addition, clients using multiple connections SHOULD NOT send
>>> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
>>> +the flush has not been replied to yet.
>>> +
>> 
>> I don't think that should be a mandatory behaviour.
> 
> Which part of it?

I've read it again :-) The wording was slightly contorted. I think
what I mean is that if you don't support flush at all, that's
another option.

The final paragraph I am not sure is right, as that's not what the kernel
currently does. If we are going to suggest a change in our main client's
behaviour, should we not just request that flush is done on all channels?

>> For once, it would
>> be reasonably easy on gonbdserver but a PITA on the reference server.
>> You'd need to put in IPC between each of the forked processes OR rely
>> on fdatasync() only - I'm not sure that would necessarily work
>> safely with (e.g.) the 'treefiles' / COW options.
>> 
>> I think better would be to say that the server MUST either
>> 
>> * Not support NBD_CMD_FLUSH at all
> 
> I think we should discourage not supporting FLUSH, rather than
> suggesting it. 

Sure, but some backends just don't support flush. For them, this
aspect at least is not a problem.

>> * Support NBD_CMD_FLUSH across channels (as you set out above), or
>> * Indicate that it does not support multiple channels.
> 
> You dropped the one with no writes. I said "at most" there for a reason.
> Originally I was going to say "if the server is read-only", but then
> thought that it could work to do the "at most" thing. After having given
> that some more thought, I now realize that if you write, you need to
> flush across to other channels, regardless of whether they write too, so
> that bit of it is moot now anyway.
> 
> Still, a server which exports read-only should still be safe for
> multiple connections, even if there is no cache coherency (since
> presumably nothing's going to change anyway).

Yes

>> Actually I think this is a problem anyway. A simpler failure case is
>> one where (by chance) one channel gets the writes, and one channel
>> gets the flushes. The flush reply is delayed beyond the replies to
>> unconnected writes (on the other channel) and hence the kernel thinks
>> replied-to writes have been persisted when they have not been.
> 
> Yes, that is another example of essentially the same problem.

Yeah, I was just trying to simplify things.

>> The only way to fix that (as far as I can see) without changing flush
>> semantics is for the block layer to issue flush requests on each
>> channel when flushing on one channel.
> 
> Christoph just said that that doesn't (currently) happen; I don't know
> whether the kernel currently already (optimistically) sends out flush
> requests before the writes that it expects to hit permanent storage have
> finished, but if it doesn't do that, then there is no problem and my
> suggested bit of spec would be okay.
> 
> If there are g

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter,

> On 15 Sep 2016, at 17:27, Wouter Verhelst  wrote:
> 
> On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
>> Wouter,
>> 
>>> The server can always refuse to allow multiple connections.
>> 
>> Sure, but it would be neater to warn the client of that at negotiation
>> stage (it would only be one flag, e.g.  'multiple connections
>> unsafe').
> 
> I suppose that's not a bad idea.

Good.

> [...]
>>> I was thinking of changing the spec as follows:
>>> 
>>> diff --git a/doc/proto.md b/doc/proto.md
>>> index 217f57e..cb099e2 100644
>>> --- a/doc/proto.md
>>> +++ b/doc/proto.md
>>> @@ -308,6 +308,23 @@ specification, the
>>> [kernel 
>>> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
>>> may be useful.
>>> 
>>> +For performance reasons, clients MAY open multiple connections to the
>>> +same server. To support such clients, servers SHOULD ensure that at
>>> +least one of the following conditions hold:
>>> +
>>> +* Flush commands are processed for ALL connections. That is, when an
>>> +  `NBD_CMD_WRITE` is processed on one connection, and then an
>>> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
>>> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
>>> +  before the reply of the `NBD_CMD_FLUSH` is sent.
>>> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
>>> +  connection
>>> +* Multiple connections are not allowed
>>> +
>>> +In addition, clients using multiple connections SHOULD NOT send
>>> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
>>> +the flush has not been replied to yet.
>>> +
>> 
>> I don't think that should be a mandatory behaviour.
> 
> Which part of it?

I've read it again :-) The wording was slightly contorted. I think
what I mean is that if you don't support flush at all, that's
another option.

The final paragraph I am not sure is right, as that's not what the kernel
currently does. If we are going to suggest a change in our main client's
behaviour, should we not just request that flush is done on all channels?

>> For once, it would
>> be reasonably easy on gonbdserver but a PITA on the reference server.
>> You'd need to put in IPC between each of the forked processes OR rely
>> on fdatasync() only - I'm not sure that would necessarily work
>> safely with (e.g.) the 'treefiles' / COW options.
>> 
>> I think better would be to say that the server MUST either
>> 
>> * Not support NBD_CMD_FLUSH at all
> 
> I think we should discourage not supporting FLUSH, rather than
> suggesting it. 

Sure, but some backends just don't support flush. For them, this
aspect at least is not a problem.

>> * Support NBD_CMD_FLUSH across channels (as you set out above), or
>> * Indicate that it does not support multiple channels.
> 
> You dropped the one with no writes. I said "at most" there for a reason.
> Originally I was going to say "if the server is read-only", but then
> thought that it could work to do the "at most" thing. After having given
> that some more thought, I now realize that if you write, you need to
> flush across to other channels, regardless of whether they write too, so
> that bit of it is moot now anyway.
> 
> Still, a server which exports read-only should still be safe for
> multiple connections, even if there is no cache coherency (since
> presumably nothing's going to change anyway).

Yes

>> Actually I think this is a problem anyway. A simpler failure case is
>> one where (by chance) one channel gets the writes, and one channel
>> gets the flushes. The flush reply is delayed beyond the replies to
>> unconnected writes (on the other channel) and hence the kernel thinks
>> replied-to writes have been persisted when they have not been.
> 
> Yes, that is another example of essentially the same problem.

Yeah, I was just trying to simplify things.

>> The only way to fix that (as far as I can see) without changing flush
>> semantics is for the block layer to issue flush requests on each
>> channel when flushing on one channel.
> 
> Christoph just said that that doesn't (currently) happen; I don't know
> whether the kernel currently already (optimistically) sends out flush
> requests before the writes that it expects to hit permanent storage have
> finished, but if it doesn't do that, then there is no problem and my
> suggested bit of spec would be okay.
> 
> If there are good reasons to do so, h

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter,

> The server can always refuse to allow multiple connections.

Sure, but it would be neater to warn the client of that
at negotiation stage (it would only be one flag, e.g.
'multiple connections unsafe'). That way the connection
won't fail with a cryptic EBUSY or whatever, but will
just negotiate a single connection.

> I was thinking of changing the spec as follows:
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 217f57e..cb099e2 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -308,6 +308,23 @@ specification, the
> [kernel 
> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
> may be useful.
> 
> +For performance reasons, clients MAY open multiple connections to the
> +same server. To support such clients, servers SHOULD ensure that at
> +least one of the following conditions hold:
> +
> +* Flush commands are processed for ALL connections. That is, when an
> +  `NBD_CMD_WRITE` is processed on one connection, and then an
> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
> +  before the reply of the `NBD_CMD_FLUSH` is sent.
> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
> +  connection
> +* Multiple connections are not allowed
> +
> +In addition, clients using multiple connections SHOULD NOT send
> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
> +the flush has not been replied to yet.
> +

I don't think that should be a mandatory behaviour. For once, it would
be reasonably easy on gonbdserver but a PITA on the reference server.
You'd need to put in IPC between each of the forked processes OR rely
on fdatasync() only - I'm not sure that would necessarily work
safely with (e.g.) the 'treefiles' / COW options.

I think better would be to say that the server MUST either

* Not support NBD_CMD_FLUSH at all
* Support NBD_CMD_FLUSH across channels (as you set out above), or
* Indicate that it does not support multiple channels.

>  Request message
> 
> The request message, sent by the client, looks as follows:
> 
> The latter bit (on the client side) is because even if your backend has
> no cache coherency issues, TCP does not guarantee ordering between
> multiple connections. I don't know if the above is in line with what
> blk-mq does, but consider the following scenario:
> 
> - A client sends two writes to the server, followed (immediately) by a
>  flush, where at least the second write and the flush are not sent over
>  the same connection.
> - The first write is a small one, and it is handled almost immediately.
> - The second write takes a little longer, so the flush is handled
>  earlier than the second write
> - The network packet containing the flush reply gets lost for whatever
>  reason, so the client doesn't get it, and we fall into TCP
>  retransmits.
> - The second write finishes, and its reply header does not get lost
> - After the second write reply reaches the client, the TCP retransmits
>  for the flush reply are handled.
> 
> In the above scenario, the flush reply arrives on the client side after
> a write reply which it did not cover; so the client will (incorrectly)
> assume that the write has reached permanent storage when in fact it may
> not have done so yet.
> 
> If the kernel does not care about the ordering of the two writes versus
> the flush, then there is no problem. I don't know how blk-mq works in
> that context, but if the above is a likely scenario, we may have to
> reconsider adding blk-mq to nbd.

Actually I think this is a problem anyway. A simpler failure case is
one where (by chance) one channel gets the writes, and one channel
gets the flushes. The flush reply is delayed beyond the replies to
unconnected writes (on the other channel) and hence the kernel thinks
replied-to writes have been persisted when they have not been.

The only way to fix that (as far as I can see) without changing flush
semantics is for the block layer to issue flush requests on each
channel when flushing on one channel. This, incidentally, would
resolve every other issue!

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter,

> The server can always refuse to allow multiple connections.

Sure, but it would be neater to warn the client of that
at negotiation stage (it would only be one flag, e.g.
'multiple connections unsafe'). That way the connection
won't fail with a cryptic EBUSY or whatever, but will
just negotiate a single connection.

> I was thinking of changing the spec as follows:
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index 217f57e..cb099e2 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -308,6 +308,23 @@ specification, the
> [kernel 
> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
> may be useful.
> 
> +For performance reasons, clients MAY open multiple connections to the
> +same server. To support such clients, servers SHOULD ensure that at
> +least one of the following conditions hold:
> +
> +* Flush commands are processed for ALL connections. That is, when an
> +  `NBD_CMD_WRITE` is processed on one connection, and then an
> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
> +  before the reply of the `NBD_CMD_FLUSH` is sent.
> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
> +  connection
> +* Multiple connections are not allowed
> +
> +In addition, clients using multiple connections SHOULD NOT send
> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
> +the flush has not been replied to yet.
> +

I don't think that should be a mandatory behaviour. For once, it would
be reasonably easy on gonbdserver but a PITA on the reference server.
You'd need to put in IPC between each of the forked processes OR rely
on fdatasync() only - I'm not sure that would necessarily work
safely with (e.g.) the 'treefiles' / COW options.

I think better would be to say that the server MUST either

* Not support NBD_CMD_FLUSH at all
* Support NBD_CMD_FLUSH across channels (as you set out above), or
* Indicate that it does not support multiple channels.

>  Request message
> 
> The request message, sent by the client, looks as follows:
> 
> The latter bit (on the client side) is because even if your backend has
> no cache coherency issues, TCP does not guarantee ordering between
> multiple connections. I don't know if the above is in line with what
> blk-mq does, but consider the following scenario:
> 
> - A client sends two writes to the server, followed (immediately) by a
>  flush, where at least the second write and the flush are not sent over
>  the same connection.
> - The first write is a small one, and it is handled almost immediately.
> - The second write takes a little longer, so the flush is handled
>  earlier than the second write
> - The network packet containing the flush reply gets lost for whatever
>  reason, so the client doesn't get it, and we fall into TCP
>  retransmits.
> - The second write finishes, and its reply header does not get lost
> - After the second write reply reaches the client, the TCP retransmits
>  for the flush reply are handled.
> 
> In the above scenario, the flush reply arrives on the client side after
> a write reply which it did not cover; so the client will (incorrectly)
> assume that the write has reached permanent storage when in fact it may
> not have done so yet.
> 
> If the kernel does not care about the ordering of the two writes versus
> the flush, then there is no problem. I don't know how blk-mq works in
> that context, but if the above is a likely scenario, we may have to
> reconsider adding blk-mq to nbd.

Actually I think this is a problem anyway. A simpler failure case is
one where (by chance) one channel gets the writes, and one channel
gets the flushes. The flush reply is delayed beyond the replies to
unconnected writes (on the other channel) and hence the kernel thinks
replied-to writes have been persisted when they have not been.

The only way to fix that (as far as I can see) without changing flush
semantics is for the block layer to issue flush requests on each
channel when flushing on one channel. This, incidentally, would
resolve every other issue!

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Eric,

> I doubt that qemu-nbd would ever want to support the situation with more
> than one client connection writing to the same image at the same time;
> the implications of sorting out data consistency between multiple
> writers is rather complex and not worth coding into qemu.  So I think
> qemu would probably prefer to just prohibit the multiple writer
> situation.

Yeah, I was thinking about a 'no multiple connection' flag.

>  And while multiple readers with no writer should be fine,
> I'm not even sure if multiple readers plus one writer can always be made
> to appear sane (if there is no coordination between the different
> connections, on an image where the writer changes AA to BA then flushes
> then changes to BB, it is still feasible that a reader could see AB
> (pre-flush state of the first sector, post-flush changes to the second
> sector, even though the writer never flushed that particular content to
> disk).

Agree

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Eric,

> I doubt that qemu-nbd would ever want to support the situation with more
> than one client connection writing to the same image at the same time;
> the implications of sorting out data consistency between multiple
> writers is rather complex and not worth coding into qemu.  So I think
> qemu would probably prefer to just prohibit the multiple writer
> situation.

Yeah, I was thinking about a 'no multiple connection' flag.

>  And while multiple readers with no writer should be fine,
> I'm not even sure if multiple readers plus one writer can always be made
> to appear sane (if there is no coordination between the different
> connections, on an image where the writer changes AA to BA then flushes
> then changes to BB, it is still feasible that a reader could see AB
> (pre-flush state of the first sector, post-flush changes to the second
> sector, even though the writer never flushed that particular content to
> disk).

Agree

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Paolo,

> On 15 Sep 2016, at 15:07, Paolo Bonzini <pbonz...@redhat.com> wrote:
> 
> I don't think QEMU forbids multiple clients to the single server, and
> guarantees consistency as long as there is no overlap between writes and
> reads.  These are the same guarantees you have for multiple commands on
> a single connection.
> 
> In other words, from the POV of QEMU there's no difference whether
> multiple commands come from one or more connections.

This isn't really about ordering, it's about cache coherency
and persisting things to disk.

What you say is correct as far as it goes in terms of ordering. However
consider the scenario with read and writes on two channels as follows
of the same block:

 Channel1 Channel2

 R  Block read, and cached in user space in
channel 1's cache
Reply sent

  W New value written, channel 2's cache updated
channel 1's cache not

 R  Value returned from channel 1's cache.

In the above scenario, there is a problem if the server(s) handling the
two channels each use a read cache which is not coherent between the
two channels. An example would be a read-through cache on a server that
did fork() and shared no state between connections.

Similarly, if there is a write on channel 1 that has completed, and
the flush goes to channel 2, it may not (if state is not shared) guarantee
that the write on channel 1 (which has completed) is persisted to non-volatile
media. Obviously if the 'state' is OS block cache/buffers/whatever, it
will, but if it's (e.g.) a user-space per process write-through cache,
it won't.

I don't know whether qemu-nbd is likely to suffer from either of these.

--
Alex Bligh

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Paolo,

> On 15 Sep 2016, at 15:07, Paolo Bonzini  wrote:
> 
> I don't think QEMU forbids multiple clients to the single server, and
> guarantees consistency as long as there is no overlap between writes and
> reads.  These are the same guarantees you have for multiple commands on
> a single connection.
> 
> In other words, from the POV of QEMU there's no difference whether
> multiple commands come from one or more connections.

This isn't really about ordering, it's about cache coherency
and persisting things to disk.

What you say is correct as far as it goes in terms of ordering. However
consider the scenario with read and writes on two channels as follows
of the same block:

 Channel1 Channel2

 R  Block read, and cached in user space in
channel 1's cache
Reply sent

  W New value written, channel 2's cache updated
channel 1's cache not

 R  Value returned from channel 1's cache.

In the above scenario, there is a problem if the server(s) handling the
two channels each use a read cache which is not coherent between the
two channels. An example would be a read-through cache on a server that
did fork() and shared no state between connections.

Similarly, if there is a write on channel 1 that has completed, and
the flush goes to channel 2, it may not (if state is not shared) guarantee
that the write on channel 1 (which has completed) is persisted to non-volatile
media. Obviously if the 'state' is OS block cache/buffers/whatever, it
will, but if it's (e.g.) a user-space per process write-through cache,
it won't.

I don't know whether qemu-nbd is likely to suffer from either of these.

--
Alex Bligh

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Josef,

> On 15 Sep 2016, at 14:57, Josef Bacik <jba...@fb.com> wrote:
> 
> This isn't an NBD problem, this is an application problem.  The application 
> must wait for all writes it cares about _before_ issuing a flush.  This is 
> the same as for normal storage as it is for NBD.  It is not NBD's 
> responsibility to maintain coherency between multiple requests across 
> connections, just simply to act on and respond to requests.
> 
> I think changing the specification to indicate that this is the case for 
> multiple connections is a good thing, to keep NBD servers from doing weird 
> things like sending different connections to the same export to different 
> backing stores without some sort of synchronization.  It should definitely be 
> explicitly stated somewhere that NBD does not provide any ordering guarantees 
> and that is up to the application.  Thanks,

I don't think that's correct.

The block stack issues a flush to mean (precisely) "do not reply to this until 
all preceding writes THAT HAVE BEEN REPLIED TO have been persisted to 
non-volatile storage".

The danger is with multiple connections (where apparently only one flush is 
sent - let's say down connection 1) that not al the writes that have been 
replied to on connection 2 have been persisted to non-volatile storage. Only 
the ones on connection 1 have been persisted (this is assuming the nbd server 
doesn't 'link' in some way the connections).

There's nothing the 'application' (here meaning the kernel or higher level) can 
do to mitigate this. Sure it can wait for all the replies, but this doesn't 
guarantee the writes have been persisted to non-volatile storage, precisely 
because writes may return prior to this.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Josef,

> On 15 Sep 2016, at 14:57, Josef Bacik  wrote:
> 
> This isn't an NBD problem, this is an application problem.  The application 
> must wait for all writes it cares about _before_ issuing a flush.  This is 
> the same as for normal storage as it is for NBD.  It is not NBD's 
> responsibility to maintain coherency between multiple requests across 
> connections, just simply to act on and respond to requests.
> 
> I think changing the specification to indicate that this is the case for 
> multiple connections is a good thing, to keep NBD servers from doing weird 
> things like sending different connections to the same export to different 
> backing stores without some sort of synchronization.  It should definitely be 
> explicitly stated somewhere that NBD does not provide any ordering guarantees 
> and that is up to the application.  Thanks,

I don't think that's correct.

The block stack issues a flush to mean (precisely) "do not reply to this until 
all preceding writes THAT HAVE BEEN REPLIED TO have been persisted to 
non-volatile storage".

The danger is with multiple connections (where apparently only one flush is 
sent - let's say down connection 1) that not al the writes that have been 
replied to on connection 2 have been persisted to non-volatile storage. Only 
the ones on connection 1 have been persisted (this is assuming the nbd server 
doesn't 'link' in some way the connections).

There's nothing the 'application' (here meaning the kernel or higher level) can 
do to mitigate this. Sure it can wait for all the replies, but this doesn't 
guarantee the writes have been persisted to non-volatile storage, precisely 
because writes may return prior to this.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:41, Christoph Hellwig <h...@infradead.org> wrote:
> 
> On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
>> That's probably right in the case of file-based back ends that
>> are running on a Linux OS. But gonbdserver for instance supports
>> (e.g.) Ceph based backends, where each connection might be talking
>> to a completely separate ceph node, and there may be no cache
>> consistency between connections.
> 
> Yes, if you don't have a cache coherent backend you are generally
> screwed with a multiqueue protocol.

I wonder if the ability to support multiqueue should be visible
in the negotiation stage. That would allow the client to refuse
to select multiqueue where it isn't safe.

Wouter?

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:41, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
>> That's probably right in the case of file-based back ends that
>> are running on a Linux OS. But gonbdserver for instance supports
>> (e.g.) Ceph based backends, where each connection might be talking
>> to a completely separate ceph node, and there may be no cache
>> consistency between connections.
> 
> Yes, if you don't have a cache coherent backend you are generally
> screwed with a multiqueue protocol.

I wonder if the ability to support multiqueue should be visible
in the negotiation stage. That would allow the client to refuse
to select multiqueue where it isn't safe.

Wouter?

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:36, Christoph Hellwig <h...@infradead.org> wrote:
> 
> On Thu, Sep 15, 2016 at 01:33:20PM +0100, Alex Bligh wrote:
>> At an implementation level that is going to be a little difficult
>> for some NBD servers, e.g. ones that fork() a different process per
>> connection. There is in general no IPC to speak of between server
>> instances. Such servers would thus be unsafe with more than one
>> connection if FLUSH is in use.
>> 
>> I believe such servers include the reference server where there is
>> process per connection (albeit possibly with several threads).
>> 
>> Even single process servers (including mine - gonbdserver) would
>> require logic to pair up multiple connections to the same
>> device.
> 
> Why?  If you only send the completion after your I/O syscall returned
> your are fine if fsync comes from a difference process, no matter
> if you're using direct or buffered I/O underneath.

That's probably right in the case of file-based back ends that
are running on a Linux OS. But gonbdserver for instance supports
(e.g.) Ceph based backends, where each connection might be talking
to a completely separate ceph node, and there may be no cache
consistency between connections.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:36, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:33:20PM +0100, Alex Bligh wrote:
>> At an implementation level that is going to be a little difficult
>> for some NBD servers, e.g. ones that fork() a different process per
>> connection. There is in general no IPC to speak of between server
>> instances. Such servers would thus be unsafe with more than one
>> connection if FLUSH is in use.
>> 
>> I believe such servers include the reference server where there is
>> process per connection (albeit possibly with several threads).
>> 
>> Even single process servers (including mine - gonbdserver) would
>> require logic to pair up multiple connections to the same
>> device.
> 
> Why?  If you only send the completion after your I/O syscall returned
> your are fine if fsync comes from a difference process, no matter
> if you're using direct or buffered I/O underneath.

That's probably right in the case of file-based back ends that
are running on a Linux OS. But gonbdserver for instance supports
(e.g.) Ceph based backends, where each connection might be talking
to a completely separate ceph node, and there may be no cache
consistency between connections.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 13:23, Christoph Hellwig <h...@infradead.org> wrote:
> 
> On Thu, Sep 15, 2016 at 02:21:20PM +0200, Wouter Verhelst wrote:
>> Right. So do I understand you correctly that blk-mq currently doesn't
>> look at multiple queues, and just assumes that if a FLUSH is sent over
>> any one of the queues, it applies to all queues?
> 
> Yes.  The same is true at the protocol level for NVMe or SCSI transports
> that can make use of multiple queues.

At an implementation level that is going to be a little difficult
for some NBD servers, e.g. ones that fork() a different process per
connection. There is in general no IPC to speak of between server
instances. Such servers would thus be unsafe with more than one
connection if FLUSH is in use.

I believe such servers include the reference server where there is
process per connection (albeit possibly with several threads).

Even single process servers (including mine - gonbdserver) would
require logic to pair up multiple connections to the same
device.

I suspect ALL nbd servers would require a change. So I think we should
think carefully before introducing this protocol change.

It might be easier to make the client code change work around this
issue (somehow).

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 13:23, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 02:21:20PM +0200, Wouter Verhelst wrote:
>> Right. So do I understand you correctly that blk-mq currently doesn't
>> look at multiple queues, and just assumes that if a FLUSH is sent over
>> any one of the queues, it applies to all queues?
> 
> Yes.  The same is true at the protocol level for NVMe or SCSI transports
> that can make use of multiple queues.

At an implementation level that is going to be a little difficult
for some NBD servers, e.g. ones that fork() a different process per
connection. There is in general no IPC to speak of between server
instances. Such servers would thus be unsafe with more than one
connection if FLUSH is in use.

I believe such servers include the reference server where there is
process per connection (albeit possibly with several threads).

Even single process servers (including mine - gonbdserver) would
require logic to pair up multiple connections to the same
device.

I suspect ALL nbd servers would require a change. So I think we should
think carefully before introducing this protocol change.

It might be easier to make the client code change work around this
issue (somehow).

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:18, Christoph Hellwig <h...@infradead.org> wrote:
> 
> Yes, please do that.  A "barrier" implies draining of the queue.

Done

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:18, Christoph Hellwig  wrote:
> 
> Yes, please do that.  A "barrier" implies draining of the queue.

Done

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Christoph,

> It's not a write barrier - a write barrier was command that ensured that
> 
> a) all previous writes were completed to the host/client
> b) all previous writes were on non-volatile storage
> 
> and
> 
> c) the actual write with the barrier bit was on non-volatile storage

Ah! the bit you are complaining about is not the bit I pointed to you, but:

> NBD_CMD_FLUSH (3)
> 
> A flush request; a write barrier. 

I can see that's potentially confusing as isn't meant to mean 'an old-style
linux kernel block device write barrier'. I think in general terms it
probably is some form of barrier, but I see no problem in deleting the
words "a write barrier" from the spec text if only to make it
clearer. However, I think the description of the command itself:

> The server MUST NOT send a successful reply header for this request before 
> all write requests for which a reply has already been sent to the client have 
> reached permanent storage (using fsync() or similar).

and the ordering section I pointed you to before, were both correct, yes?


>> The point still remains that "X was sent before Y" is difficult to
>> determine on the client side if X was sent over a different TCP channel
>> than Y, because a packet might be dropped (requiring a retransmit) for
>> X, and similar things. If blk-mq can deal with that, we're good and
>> nothing further needs to be done. If not, this should be evaluated by
>> someone more familiar with the internals of the kernel block subsystem
>> than me.
> 
> The important bit in all the existing protocols, and which Linux relies
> on is that any write the Linux block layer got a completion for earlier
> needs to be flushed out to non-volatile storage when a FLUSH command is
> set.  Anything that still is in flight does not matter.  Which for
> NBD means anything that you already replies to need to be flushed.

... that's what it says (I hope).

> Or to say it more practicly - in the nbd server you simply need to
> call fdatasync on the backing device or file whenever you get a FLUSH
> requires, and it will do the right thing.

actually fdatasync() technically does more than is necessary, as it
will also flush commands that have been processed, but for which no
reply has yet been sent - that's no bad thing.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Christoph,

> It's not a write barrier - a write barrier was command that ensured that
> 
> a) all previous writes were completed to the host/client
> b) all previous writes were on non-volatile storage
> 
> and
> 
> c) the actual write with the barrier bit was on non-volatile storage

Ah! the bit you are complaining about is not the bit I pointed to you, but:

> NBD_CMD_FLUSH (3)
> 
> A flush request; a write barrier. 

I can see that's potentially confusing as isn't meant to mean 'an old-style
linux kernel block device write barrier'. I think in general terms it
probably is some form of barrier, but I see no problem in deleting the
words "a write barrier" from the spec text if only to make it
clearer. However, I think the description of the command itself:

> The server MUST NOT send a successful reply header for this request before 
> all write requests for which a reply has already been sent to the client have 
> reached permanent storage (using fsync() or similar).

and the ordering section I pointed you to before, were both correct, yes?


>> The point still remains that "X was sent before Y" is difficult to
>> determine on the client side if X was sent over a different TCP channel
>> than Y, because a packet might be dropped (requiring a retransmit) for
>> X, and similar things. If blk-mq can deal with that, we're good and
>> nothing further needs to be done. If not, this should be evaluated by
>> someone more familiar with the internals of the kernel block subsystem
>> than me.
> 
> The important bit in all the existing protocols, and which Linux relies
> on is that any write the Linux block layer got a completion for earlier
> needs to be flushed out to non-volatile storage when a FLUSH command is
> set.  Anything that still is in flight does not matter.  Which for
> NBD means anything that you already replies to need to be flushed.

... that's what it says (I hope).

> Or to say it more practicly - in the nbd server you simply need to
> call fdatasync on the backing device or file whenever you get a FLUSH
> requires, and it will do the right thing.

actually fdatasync() technically does more than is necessary, as it
will also flush commands that have been processed, but for which no
reply has yet been sent - that's no bad thing.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 12:52, Christoph Hellwig <h...@infradead.org> wrote:
> 
> On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote:
>> Essentially NBD does supports FLUSH/FUA like this:
>> 
>> https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
>> 
>> IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).
>> 
>> Link to protocol (per last email) here:
>> 
>> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
> 
> Flush as defined by the Linux block layer (and supported that way in
> SCSI, ATA, NVMe) only requires to flush all already completed writes
> to non-volatile media.  It does not impose any ordering unlike the
> nbd spec.

As maintainer of the NBD spec, I'm confused as to why you think it
imposes any ordering - if you think this, clearly I need to clean up
the wording.

Here's what it says:

> The server MAY process commands out of order, and MAY reply out of order,
> except that:
> 
>   • All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM)
> that the server completes (i.e. replies to) prior to processing to a
> NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to 
> that
> NBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set 
> within
> the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent by the
> client to the server.

(and another bit re FUA that isn't relevant here).

Here's the Linux Kernel documentation:

> The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from
> the filesystem and will make sure the volatile cache of the storage device
> has been flushed before the actual I/O operation is started.  This explicitly
> guarantees that previously completed write requests are on non-volatile
> storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can 
> be
> set on an otherwise empty bio structure, which causes only an explicit cache
> flush without any dependent I/O.  It is recommend to use
> the blkdev_issue_flush() helper for a pure cache flush.

I believe that NBD treats NBD_CMD_FLUSH the same as a REQ_PREFLUSH and empty
bio.

If you don't read those two as compatible, I'd like to understand why not
(i.e. what additional constraints one is applying that the other is not)
as they are meant to be the same (save that NBD only has FLUSH as a command,
i.e. the 'empty bio' version). I am happy to improve the docs to make it
clearer.

(sidenote: I am interested in the change from REQ_FLUSH to REQ_PREFLUSH,
but in an empty bio it's not really relevant I think).

> FUA as defined by the Linux block layer (and supported that way in SCSI,
> ATA, NVMe) only requires the write operation the FUA bit is set on to be
> on non-volatile media before completing the write operation.  It does
> not impose any ordering, which seems to match the nbd spec.  Unlike the
> NBD spec Linux does not allow FUA to be set on anything by WRITE
> commands.  Some other storage protocols allow a FUA bit on READ
> commands or other commands that write data to the device, though.

I think you mean "anything *but* WRITE commands". In NBD setting
FUA on a command that does not write will do nothing, but FUA can
be set on NBD_CMD_TRIM and has the expected effect.

Interestingly the kernel docs are silent on which commands REQ_FUA
can be set on.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 12:52, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote:
>> Essentially NBD does supports FLUSH/FUA like this:
>> 
>> https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
>> 
>> IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).
>> 
>> Link to protocol (per last email) here:
>> 
>> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
> 
> Flush as defined by the Linux block layer (and supported that way in
> SCSI, ATA, NVMe) only requires to flush all already completed writes
> to non-volatile media.  It does not impose any ordering unlike the
> nbd spec.

As maintainer of the NBD spec, I'm confused as to why you think it
imposes any ordering - if you think this, clearly I need to clean up
the wording.

Here's what it says:

> The server MAY process commands out of order, and MAY reply out of order,
> except that:
> 
>   • All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM)
> that the server completes (i.e. replies to) prior to processing to a
> NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to 
> that
> NBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set 
> within
> the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent by the
> client to the server.

(and another bit re FUA that isn't relevant here).

Here's the Linux Kernel documentation:

> The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from
> the filesystem and will make sure the volatile cache of the storage device
> has been flushed before the actual I/O operation is started.  This explicitly
> guarantees that previously completed write requests are on non-volatile
> storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can 
> be
> set on an otherwise empty bio structure, which causes only an explicit cache
> flush without any dependent I/O.  It is recommend to use
> the blkdev_issue_flush() helper for a pure cache flush.

I believe that NBD treats NBD_CMD_FLUSH the same as a REQ_PREFLUSH and empty
bio.

If you don't read those two as compatible, I'd like to understand why not
(i.e. what additional constraints one is applying that the other is not)
as they are meant to be the same (save that NBD only has FLUSH as a command,
i.e. the 'empty bio' version). I am happy to improve the docs to make it
clearer.

(sidenote: I am interested in the change from REQ_FLUSH to REQ_PREFLUSH,
but in an empty bio it's not really relevant I think).

> FUA as defined by the Linux block layer (and supported that way in SCSI,
> ATA, NVMe) only requires the write operation the FUA bit is set on to be
> on non-volatile media before completing the write operation.  It does
> not impose any ordering, which seems to match the nbd spec.  Unlike the
> NBD spec Linux does not allow FUA to be set on anything by WRITE
> commands.  Some other storage protocols allow a FUA bit on READ
> commands or other commands that write data to the device, though.

I think you mean "anything *but* WRITE commands". In NBD setting
FUA on a command that does not write will do nothing, but FUA can
be set on NBD_CMD_TRIM and has the expected effect.

Interestingly the kernel docs are silent on which commands REQ_FUA
can be set on.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 12:46, Christoph Hellwig <h...@infradead.org> wrote:
> 
> On Thu, Sep 15, 2016 at 12:43:35PM +0100, Alex Bligh wrote:
>> Sure, it's at:
>> 
>> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
>> 
>> and that link takes you to the specific section.
>> 
>> The treatment of FLUSH and FUA is meant to mirror exactly the
>> linux block layer (or rather how the linux block layer was a few
>> years ago). I even asked on LKML to verify a few points.
> 
> Linux never expected ordering on the wire.  Before 2010 we had barriers
> in the kernel that provided ordering to the caller, but we never
> required it from the protocol / hardware.

Sure. And I think the doc section reflects exactly Linux's post-2010
expectations (note the link to the kernel documentation). IE
servers are not required to order reads/writes (save that all
writes that the server completes prior to reply to an NBD_CMD_FLUSH
must be persisted prior to the reply to that NBD_CMD_FLUSH
which I believe to be exactly the same behaviour as the kernel
dealing with an empty bio with REQ_FLUSH set).

Perhaps the section should be called "No ordering of messages
and writes"!

My point was that *in practice* disordering is not well tested
as *in practice* many server implementations do in fact process
each command in order, though that's changed recently.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 12:46, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 12:43:35PM +0100, Alex Bligh wrote:
>> Sure, it's at:
>> 
>> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
>> 
>> and that link takes you to the specific section.
>> 
>> The treatment of FLUSH and FUA is meant to mirror exactly the
>> linux block layer (or rather how the linux block layer was a few
>> years ago). I even asked on LKML to verify a few points.
> 
> Linux never expected ordering on the wire.  Before 2010 we had barriers
> in the kernel that provided ordering to the caller, but we never
> required it from the protocol / hardware.

Sure. And I think the doc section reflects exactly Linux's post-2010
expectations (note the link to the kernel documentation). IE
servers are not required to order reads/writes (save that all
writes that the server completes prior to reply to an NBD_CMD_FLUSH
must be persisted prior to the reply to that NBD_CMD_FLUSH
which I believe to be exactly the same behaviour as the kernel
dealing with an empty bio with REQ_FLUSH set).

Perhaps the section should be called "No ordering of messages
and writes"!

My point was that *in practice* disordering is not well tested
as *in practice* many server implementations do in fact process
each command in order, though that's changed recently.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 12:40, Christoph Hellwig <h...@infradead.org> wrote:
> 
> On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote:
>> Yes, and that is why I was asking about this. If the write barriers
>> are expected to be shared across connections, we have a problem. If,
>> however, they are not, then it doesn't matter that the commands may be
>> processed out of order.
> 
> There is no such thing as a write barrier in the Linux kernel.  We'd
> much prefer protocols not to introduce any pointless synchronization
> if we can avoid it.

I suspect the issue is terminological.

Essentially NBD does supports FLUSH/FUA like this:

https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt

IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).

Link to protocol (per last email) here:

https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 12:40, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote:
>> Yes, and that is why I was asking about this. If the write barriers
>> are expected to be shared across connections, we have a problem. If,
>> however, they are not, then it doesn't matter that the commands may be
>> processed out of order.
> 
> There is no such thing as a write barrier in the Linux kernel.  We'd
> much prefer protocols not to introduce any pointless synchronization
> if we can avoid it.

I suspect the issue is terminological.

Essentially NBD does supports FLUSH/FUA like this:

https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt

IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).

Link to protocol (per last email) here:

https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Christoph,

> On 15 Sep 2016, at 12:38, Christoph Hellwig <h...@infradead.org> wrote:
> 
> On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
>> A while back, we spent quite some time defining the semantics of the
>> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
>> write barriers. At the time, we decided that it would be unreasonable
>> to expect servers to make these write barriers effective across
>> different connections.
> 
> Do you have a nbd protocol specification?  treating a flush or fua
> as any sort of barrier is incredibly stupid.  Is it really documented
> that way, and if yes, why?

Sure, it's at:

https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

and that link takes you to the specific section.

The treatment of FLUSH and FUA is meant to mirror exactly the
linux block layer (or rather how the linux block layer was a few
years ago). I even asked on LKML to verify a few points.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Christoph,

> On 15 Sep 2016, at 12:38, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
>> A while back, we spent quite some time defining the semantics of the
>> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
>> write barriers. At the time, we decided that it would be unreasonable
>> to expect servers to make these write barriers effective across
>> different connections.
> 
> Do you have a nbd protocol specification?  treating a flush or fua
> as any sort of barrier is incredibly stupid.  Is it really documented
> that way, and if yes, why?

Sure, it's at:

https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

and that link takes you to the specific section.

The treatment of FLUSH and FUA is meant to mirror exactly the
linux block layer (or rather how the linux block layer was a few
years ago). I even asked on LKML to verify a few points.

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter, Josef, (& Eric)

> On 15 Sep 2016, at 11:49, Wouter Verhelst <w...@uter.be> wrote:
> 
> Hi,
> 
> On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
>> I see some practical problems with this:
> [...]
> 
> One more that I didn't think about earlier:
> 
> A while back, we spent quite some time defining the semantics of the
> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> write barriers. At the time, we decided that it would be unreasonable
> to expect servers to make these write barriers effective across
> different connections.

Actually I wonder if there is a wider problem in that implementations
might mediate access to a device by presence of an extant TCP connection,
i.e. only permit one TCP connection to access a given block device at
once. If you think about (for instance) a forking daemon that does
writeback caching, that would be an entirely reasonable thing to do
for data consistency.

I also wonder whether any servers that can do caching per
connection will always share a consistent cache between 
connections. The one I'm worried about in particular here
is qemu-nbd - Eric Blake CC'd.

A more general point is that with multiple queues requests
may be processed in a different order even by those servers that
currently process the requests in strict order, or in something
similar to strict order. The server is permitted by the spec
(save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
process commands out of order anyway, but I suspect this has
to date been little tested.

Lastly I confess to lack of familiarity with the kernel side
code, but how is NBD_CMD_DISCONNECT synchronised across
each of the connections? Presumably you need to send it on
each channel, but cannot assume the NBD connection as a whole
is dead until the last tcp connection has closed?

-- 
Alex Bligh

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter, Josef, (& Eric)

> On 15 Sep 2016, at 11:49, Wouter Verhelst  wrote:
> 
> Hi,
> 
> On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
>> I see some practical problems with this:
> [...]
> 
> One more that I didn't think about earlier:
> 
> A while back, we spent quite some time defining the semantics of the
> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> write barriers. At the time, we decided that it would be unreasonable
> to expect servers to make these write barriers effective across
> different connections.

Actually I wonder if there is a wider problem in that implementations
might mediate access to a device by presence of an extant TCP connection,
i.e. only permit one TCP connection to access a given block device at
once. If you think about (for instance) a forking daemon that does
writeback caching, that would be an entirely reasonable thing to do
for data consistency.

I also wonder whether any servers that can do caching per
connection will always share a consistent cache between 
connections. The one I'm worried about in particular here
is qemu-nbd - Eric Blake CC'd.

A more general point is that with multiple queues requests
may be processed in a different order even by those servers that
currently process the requests in strict order, or in something
similar to strict order. The server is permitted by the spec
(save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
process commands out of order anyway, but I suspect this has
to date been little tested.

Lastly I confess to lack of familiarity with the kernel side
code, but how is NBD_CMD_DISCONNECT synchronised across
each of the connections? Presumably you need to send it on
each channel, but cannot assume the NBD connection as a whole
is dead until the last tcp connection has closed?

-- 
Alex Bligh

Re: [Nbd] [PATCH 2/2] nbd: Disallow ioctls on disconnected block device

2016-07-16 Thread Alex Bligh

On 16 Jul 2016, at 08:42, Pranay Srivastava <pran...@gmail.com> wrote:

> So instead can't we put a mechanism in place for network address + mac
> to be same
> for allowing clients to reconnect? Do let me know if this is not of concern.

MAC address?! nbd clients connect over IP, and if a router reboots
between them, you could easily see two packets from the same client
come from different MAC addresses. Similarly all clients not on
the same L2 network will carry the same MAC address. So MAC address
is a very poor indicator of 'same client'.

IP address is also a poor indicator (think NAT) but is substantially
less bad.

-- 
Alex Bligh

Re: [Nbd] [PATCH 2/2] nbd: Disallow ioctls on disconnected block device

2016-07-16 Thread Alex Bligh

On 16 Jul 2016, at 08:42, Pranay Srivastava  wrote:

> So instead can't we put a mechanism in place for network address + mac
> to be same
> for allowing clients to reconnect? Do let me know if this is not of concern.

MAC address?! nbd clients connect over IP, and if a router reboots
between them, you could easily see two packets from the same client
come from different MAC addresses. Similarly all clients not on
the same L2 network will carry the same MAC address. So MAC address
is a very poor indicator of 'same client'.

IP address is also a poor indicator (think NAT) but is substantially
less bad.

-- 
Alex Bligh

Block layer - meaning of REQ_FUA on not write requests

2016-04-01 Thread Alex Bligh

I am trying to clean up the documentation of the NBD protocol. NBD's
support for Force Unit Access (FUA) was modelled on the linux kernel
block layer. When I added support a few years ago, I omitted to find
out exactly what types of request it applies to. Obviously it applies
to write requests, but how about others (e.g. read)?

https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt

seems somewhat ambiguous. It says

> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
> filesystem and will make sure that I/O completion for this request is only
> signaled after the data has been committed to non-volatile storage.

and

> For devices that also support the FUA bit the block layer needs
> to be told to pass through the REQ_FUA bit using:
> 
>   blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
> 
> and the driver must handle write requests that have the REQ_FUA bit set
> in prep_fn/request_fn.  If the FUA bit is not natively supported the block
> layer turns it into an empty REQ_FLUSH request after the actual write.

I can see four possible interpretations for reads:

1. REQ_FUA is not permitted on requests other than write, and it is an
   error to include it.

2. REQ_FUA does nothing on requests other than writes, and whilst permitted
   is ignored.

3. REQ_FUA has the following meaning on read: that area of data referred
   to by the read must be written to persistent storage if it is dirty.

and the following which is not an interpretation of the above, but is
logical:

4. REQ_FUA means that any in-memory copy of the data to be read must be
   discarded and the data reread from persistent storage (in case it
   has been written to by another client, for instance).

What meaning does REQ_FUA have for reads? Similarly what meaning does
it have for other block layer requests (e.g. trim)?


-- 
Alex Bligh

Block layer - meaning of REQ_FUA on not write requests

2016-04-01 Thread Alex Bligh

I am trying to clean up the documentation of the NBD protocol. NBD's
support for Force Unit Access (FUA) was modelled on the linux kernel
block layer. When I added support a few years ago, I omitted to find
out exactly what types of request it applies to. Obviously it applies
to write requests, but how about others (e.g. read)?

https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt

seems somewhat ambiguous. It says

> The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
> filesystem and will make sure that I/O completion for this request is only
> signaled after the data has been committed to non-volatile storage.

and

> For devices that also support the FUA bit the block layer needs
> to be told to pass through the REQ_FUA bit using:
> 
>   blk_queue_flush(sdkp->disk->queue, REQ_FLUSH | REQ_FUA);
> 
> and the driver must handle write requests that have the REQ_FUA bit set
> in prep_fn/request_fn.  If the FUA bit is not natively supported the block
> layer turns it into an empty REQ_FLUSH request after the actual write.

I can see four possible interpretations for reads:

1. REQ_FUA is not permitted on requests other than write, and it is an
   error to include it.

2. REQ_FUA does nothing on requests other than writes, and whilst permitted
   is ignored.

3. REQ_FUA has the following meaning on read: that area of data referred
   to by the read must be written to persistent storage if it is dirty.

and the following which is not an interpretation of the above, but is
logical:

4. REQ_FUA means that any in-memory copy of the data to be read must be
   discarded and the data reread from persistent storage (in case it
   has been written to by another client, for instance).

What meaning does REQ_FUA have for reads? Similarly what meaning does
it have for other block layer requests (e.g. trim)?


-- 
Alex Bligh

Re: kernel panic in skb_copy_bits

2013-07-04 Thread Alex Bligh




--On 4 July 2013 03:12:10 -0700 Eric Dumazet  wrote:


It looks like a typical COW issue to me.

If the page content is written while there is still a reference on this
page, we should allocate a new page and copy the previous content.

And this has little to do with networking.


I suspect this would get more attention if we could make Ian's case
below trigger (a) outside Xen, (b) outside networking.


memset(buf, 0xaa, 4096);
write(fd, buf, 4096)
memset(buf, 0x55, 4096);
(where fd is O_DIRECT on NFS) Can result in 0x55 being seen on the wire
in the TCP retransmit.


We know this should fail using O_DIRECT+NFS. We've had reports suggesting
it fails in O_DIRECT+iSCSI. However, that's been with a kernel panic
(under Xen) rather than data corruption as per the above.

Historical trawling suggests this is an issue with DRDB (see Ian's
original thread from the mists of time).

I don't quite understand why we aren't seeing corruption with standard
ATA devices + O_DIRECT and no Xen involved at all.

My memory is a bit misty on this but I had thought the reason why
this would NOT be solved simply by O_DIRECT taking a reference to
the page was that the O_DIRECT I/O completed (and thus the reference
would be freed up) before the networking stack had actually finished
with the page. If the O_DIRECT I/O did not complete until the
page was actually finished with, we wouldn't see the problem in the
first place. I may be completely off base here.

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel panic in skb_copy_bits

2013-07-04 Thread Alex Bligh




--On 4 July 2013 03:12:10 -0700 Eric Dumazet eric.duma...@gmail.com wrote:


It looks like a typical COW issue to me.

If the page content is written while there is still a reference on this
page, we should allocate a new page and copy the previous content.

And this has little to do with networking.


I suspect this would get more attention if we could make Ian's case
below trigger (a) outside Xen, (b) outside networking.


memset(buf, 0xaa, 4096);
write(fd, buf, 4096)
memset(buf, 0x55, 4096);
(where fd is O_DIRECT on NFS) Can result in 0x55 being seen on the wire
in the TCP retransmit.


We know this should fail using O_DIRECT+NFS. We've had reports suggesting
it fails in O_DIRECT+iSCSI. However, that's been with a kernel panic
(under Xen) rather than data corruption as per the above.

Historical trawling suggests this is an issue with DRDB (see Ian's
original thread from the mists of time).

I don't quite understand why we aren't seeing corruption with standard
ATA devices + O_DIRECT and no Xen involved at all.

My memory is a bit misty on this but I had thought the reason why
this would NOT be solved simply by O_DIRECT taking a reference to
the page was that the O_DIRECT I/O completed (and thus the reference
would be freed up) before the networking stack had actually finished
with the page. If the O_DIRECT I/O did not complete until the
page was actually finished with, we wouldn't see the problem in the
first place. I may be completely off base here.

--
Alex Bligh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel panic in skb_copy_bits

2013-07-01 Thread Alex Bligh


Joe,


Do you know if have a fix for above? so far we also suspected the
grant page be unmapped earlier, we using 4.1 stable during our test.


A true fix? No, but I posted a patch set (see later email message
for a link) that you could forward port. The workaround is:


A workaround is to turn off O_DIRECT use by Xen as that ensures
the pages are copied. Xen 4.3 does this by default.

I believe fixes for this are in 4.3 and 4.2.2 if using the
qemu upstream DM. Note these aren't real fixes, just a workaround
of a kernel bug.


The guest is pvm, and disk model is xvbd, guest config file as below:

...

I think this only for pvhvm/hvm?


I don't have much experience outside pvhvm/hvm, but I believe it
should work for any device.

Testing was simple - just find all (*) the references to O_DIRECT
in your device model and remove them!

(*)=you could be less lazy than me and find the right ones.

I am guessing it will be the same ones though.

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel panic in skb_copy_bits

2013-07-01 Thread Alex Bligh


Joe,


Do you know if have a fix for above? so far we also suspected the
grant page be unmapped earlier, we using 4.1 stable during our test.


A true fix? No, but I posted a patch set (see later email message
for a link) that you could forward port. The workaround is:


A workaround is to turn off O_DIRECT use by Xen as that ensures
the pages are copied. Xen 4.3 does this by default.

I believe fixes for this are in 4.3 and 4.2.2 if using the
qemu upstream DM. Note these aren't real fixes, just a workaround
of a kernel bug.


The guest is pvm, and disk model is xvbd, guest config file as below:

...

I think this only for pvhvm/hvm?


I don't have much experience outside pvhvm/hvm, but I believe it
should work for any device.

Testing was simple - just find all (*) the references to O_DIRECT
in your device model and remove them!

(*)=you could be less lazy than me and find the right ones.

I am guessing it will be the same ones though.

--
Alex Bligh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel panic in skb_copy_bits

2013-06-30 Thread Alex Bligh




--On 30 June 2013 10:13:35 +0100 Alex Bligh  wrote:


The nature of the bug
is extensively discussed in that thread - you'll also find
a reference to a thread on linux-nfs which concludes it
isn't an nfs problem, and even some patches to fix it in the
kernel adding reference counting.


Some more links for anyone interested in fixing the kernel bug:

http://lists.xen.org/archives/html/xen-devel/2013-01/msg01618.html
http://www.spinics.net/lists/linux-nfs/msg34913.html
http://www.spinics.net/lists/netdev/msg224106.html

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel panic in skb_copy_bits

2013-06-30 Thread Alex Bligh

--On 28 June 2013 12:17:43 +0800 Joe Jin wrote:

Find a similar issue
http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen
developer as well.

I thought this sounded familiar. I haven't got the start of this
thread, but what version of Xen are you running and what device
model? If before 4.3, there is a page lifetime bug in the kernel
(not the xen code) which can affect anything where the guest accesses
the host's block stack and that in turn accesses the networking
stack (it may in fact be wider than that). So, e.g. domU on
iCSSI will do it. It tends to get triggered by a TCP retransmit
or (on NFS) the RPC equivalent. Essentially block operation
is considered complete, returning through xen and freeing the
grant table entry, and yet something in the kernel (e.g. tcp
retransmit) can still access the data. The nature of the bug
is extensively discussed in that thread - you'll also find
a reference to a thread on linux-nfs which concludes it
isn't an nfs problem, and even some patches to fix it in the
kernel adding reference counting.

A workaround is to turn off O_DIRECT use by Xen as that ensures
the pages are copied. Xen 4.3 does this by default.

I believe fixes for this are in 4.3 and 4.2.2 if using the
qemu upstream DM. Note these aren't real fixes, just a workaround
of a kernel bug.

To fix on a local build of xen you will need something like this:
https://github.com/abligh/qemu-upstream-4.2-testing/commit/9a97c011e1a682eed9bc7195a25349eaf23ff3f9
and something like this (NB: obviously insert your own git
repo and commit numbers)
https://github.com/abligh/xen/commit/f5c344afac96ced8b980b9659fb3e81c4a0db5ca

Also note those fixes are (technically) unsafe for live migration
unless there is an ordering change made in qemu's block open
call.

Of course this might be something completely different.

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: kernel panic in skb_copy_bits

2013-06-30 Thread Alex Bligh

--On 28 June 2013 12:17:43 +0800 Joe Jin joe@oracle.com wrote:

Find a similar issue
http://www.gossamer-threads.com/lists/xen/devel/265611 So copied to Xen
developer as well.

A workaround is to turn off O_DIRECT use by Xen as that ensures
the pages are copied. Xen 4.3 does this by default.

I believe fixes for this are in 4.3 and 4.2.2 if using the
qemu upstream DM. Note these aren't real fixes, just a workaround
of a kernel bug.

Also note those fixes are (technically) unsafe for live migration
unless there is an ordering change made in qemu's block open
call.

Of course this might be something completely different.

--
Alex Bligh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: kernel panic in skb_copy_bits

2013-06-30 Thread Alex Bligh




--On 30 June 2013 10:13:35 +0100 Alex Bligh a...@alex.org.uk wrote:


The nature of the bug
is extensively discussed in that thread - you'll also find
a reference to a thread on linux-nfs which concludes it
isn't an nfs problem, and even some patches to fix it in the
kernel adding reference counting.


Some more links for anyone interested in fixing the kernel bug:

http://lists.xen.org/archives/html/xen-devel/2013-01/msg01618.html
http://www.spinics.net/lists/linux-nfs/msg34913.html
http://www.spinics.net/lists/netdev/msg224106.html

--
Alex Bligh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-13 Thread Alex Bligh


On 13 Feb 2013, at 16:02, Paolo Bonzini wrote:

> If you do not have REQ_FUA, as is the case with this patch, the block
> layer converts it to a REQ_FLUSH *after* the write.

OK. Well +/- it converting an fdatasync() into a full fsync(), that
should be be ok.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-13 Thread Alex Bligh

Paolo,

On 13 Feb 2013, at 13:00, Paolo Bonzini wrote:

> But as far as I can test with free servers, the FUA bits have no
> advantage over flush.  Also, I wasn't sure if SEND_FUA without
> SEND_FLUSH is valid, and if so how to handle this combination (treat it
> as writethrough and add FUA to all requests? warn and do nothing?).

On the main opensource nbd client, the following applies:

What REQ_FUA does is an fdatasync() after the write. Code extract and
comments below from Christoph Hellwig.

What REQ_FLUSH does is to do an fsync().

The way I read Christoph's comment, provided the linux block layer always
issues a REQ_FLUSH before a REQ_FUA, there is not performance problem.

However, a REQ_FUA is going to do a f(data)?sync AFTER the write, whereas
the preceding REQ_FLUSH is going to an fsync() BEFORE the write. It seems
to me that either the FUA and FLUSH semantics are therefore different
(and we need FUA), or that Christoph's comment is wrong and that you
are guaranteed a REQ_FLUSH *after* the write with REQ_FUA.

-- 
Alex Bligh

} else if (fua) {

  /* This is where we would do the following
   *   #ifdef USE_SYNC_FILE_RANGE
   * However, we don't, for the reasons set out below
   * by Christoph Hellwig 
   *
   * [BEGINS] 
   * fdatasync is equivalent to fsync except that it does not flush
   * non-essential metadata (basically just timestamps in practice), 
but it
   * does flush metadata requried to find the data again, e.g. 
allocation
   * information and extent maps.  sync_file_range does nothing but 
flush
   * out pagecache content - it means you basically won't get your data
   * back in case of a crash if you either:
   * 
   *  a) have a volatile write cache in your disk (e.g. any normal SATA 
disk)
   *  b) are using a sparse file on a filesystem
   *  c) are using a fallocate-preallocated file on a filesystem
   *  d) use any file on a COW filesystem like btrfs
   * 
   * e.g. it only does anything useful for you if you do not have a 
volatile
   * write cache, and either use a raw block device node, or just 
overwrite
   * an already fully allocated (and not preallocated) file on a non-COW
   * filesystem.
   * [ENDS]
   *
   * What we should do is open a second FD with O_DSYNC set, then write 
to
   * that when appropriate. However, with a Linux client, every REQ_FUA
   * immediately follows a REQ_FLUSH, so fdatasync does not cause 
performance
   * problems.
   *
   */
#if 0
sync_file_range(fhandle, foffset, len,
SYNC_FILE_RANGE_WAIT_BEFORE | 
SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER);
#else
fdatasync(fhandle);
#endif
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-13 Thread Alex Bligh

Paolo,

On 13 Feb 2013, at 13:00, Paolo Bonzini wrote:

 But as far as I can test with free servers, the FUA bits have no
 advantage over flush.  Also, I wasn't sure if SEND_FUA without
 SEND_FLUSH is valid, and if so how to handle this combination (treat it
 as writethrough and add FUA to all requests? warn and do nothing?).

On the main opensource nbd client, the following applies:

What REQ_FUA does is an fdatasync() after the write. Code extract and
comments below from Christoph Hellwig.

What REQ_FLUSH does is to do an fsync().

The way I read Christoph's comment, provided the linux block layer always
issues a REQ_FLUSH before a REQ_FUA, there is not performance problem.

However, a REQ_FUA is going to do a f(data)?sync AFTER the write, whereas
the preceding REQ_FLUSH is going to an fsync() BEFORE the write. It seems
to me that either the FUA and FLUSH semantics are therefore different
(and we need FUA), or that Christoph's comment is wrong and that you
are guaranteed a REQ_FLUSH *after* the write with REQ_FUA.

-- 
Alex Bligh




} else if (fua) {

  /* This is where we would do the following
   *   #ifdef USE_SYNC_FILE_RANGE
   * However, we don't, for the reasons set out below
   * by Christoph Hellwig h...@infradead.org
   *
   * [BEGINS] 
   * fdatasync is equivalent to fsync except that it does not flush
   * non-essential metadata (basically just timestamps in practice), 
but it
   * does flush metadata requried to find the data again, e.g. 
allocation
   * information and extent maps.  sync_file_range does nothing but 
flush
   * out pagecache content - it means you basically won't get your data
   * back in case of a crash if you either:
   * 
   *  a) have a volatile write cache in your disk (e.g. any normal SATA 
disk)
   *  b) are using a sparse file on a filesystem
   *  c) are using a fallocate-preallocated file on a filesystem
   *  d) use any file on a COW filesystem like btrfs
   * 
   * e.g. it only does anything useful for you if you do not have a 
volatile
   * write cache, and either use a raw block device node, or just 
overwrite
   * an already fully allocated (and not preallocated) file on a non-COW
   * filesystem.
   * [ENDS]
   *
   * What we should do is open a second FD with O_DSYNC set, then write 
to
   * that when appropriate. However, with a Linux client, every REQ_FUA
   * immediately follows a REQ_FLUSH, so fdatasync does not cause 
performance
   * problems.
   *
   */
#if 0
sync_file_range(fhandle, foffset, len,
SYNC_FILE_RANGE_WAIT_BEFORE | 
SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER);
#else
fdatasync(fhandle);
#endif
}


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-13 Thread Alex Bligh


On 13 Feb 2013, at 16:02, Paolo Bonzini wrote:

 If you do not have REQ_FUA, as is the case with this patch, the block
 layer converts it to a REQ_FLUSH *after* the write.

OK. Well +/- it converting an fdatasync() into a full fsync(), that
should be be ok.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-12 Thread Alex Bligh

On 12 Feb 2013, at 21:32, Andrew Morton wrote:

> 
> Obviously the changelog was inadequate.  Please send along a new one
> which fully describes the reasons for this change.

To be clear I have no complaints about the rest of the patch being
merged. Supporting FLUSH but not FUA is far better than supporting
neither. I just didn't understand dropping FUA given the semantics
of nbd is in essence 'linux bios over tcp'.

-- 
Alex Bligh

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-12 Thread Alex Bligh

Paolo,

On 12 Feb 2013, at 18:06, Paolo Bonzini wrote:

> Il 12/02/2013 18:37, Alex Bligh ha scritto:
>> For my education, why remove the FUA stuff?
> 
> Because I had no way to test it.

I think my mods to the official NBD code support FUA (albeit not very 
efficiently)

>>>>>> Hmmm... the underlying storage could be md/dm RAIDs in which case FUA
>>>>>> should be cheaper than FLUSH.
>>>> 
>>>> If someone ever wrote a virtio-blk backend that sits directly ontop
>>>> of the Linux block layer that would be true.
>> In this case we don't know what the backend is sitting on top of
>> a-priori. It might be the current nbd server code, but it might
>> not be.
> 
> Do you know of any other NBD server than the "official" one and qemu-nbd?


Yes. I know one well (but it's not open source). NBD seems to be the 'goto 
protocol' for writing distributed block store drivers in user space.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-12 Thread Alex Bligh

Paolo,

> Add support for a new flag that the server can pass.  If the flag is
> enabled, we translate REQ_FLUSH requests into the NBD_CMD_FLUSH
> command.
> 
> Cc: 
> Cc: Paul Clements 
> Cc: Andrew Morton 
> Signed-off-by: Alex Bligh 
> [ Removed FUA support for reasons similar to those outlined in
>  https://lkml.org/lkml/2010/8/17/234 for virtio - Paolo ]
> Signed-off-by: Paolo Bonzini 


For my education, why remove the FUA stuff? The link you posted
was for virtio drivers, and says:

> > Hmmm... the underlying storage could be md/dm RAIDs in which case FUA
> > should be cheaper than FLUSH.
> 
> If someone ever wrote a virtio-blk backend that sits directly ontop
> of the Linux block layer that would be true.

In this case we don't know what the backend is sitting on top of
a-priori. It might be the current nbd server code, but it might
not be.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-12 Thread Alex Bligh

Paolo,

 Add support for a new flag that the server can pass.  If the flag is
 enabled, we translate REQ_FLUSH requests into the NBD_CMD_FLUSH
 command.

 Cc: nbd-gene...@lists.sf.net
 Cc: Paul Clements paul.cleme...@steeleye.com
 Cc: Andrew Morton a...@linux-foundation.org
 Signed-off-by: Alex Bligh a...@alex.org.uk
 [ Removed FUA support for reasons similar to those outlined in
  https://lkml.org/lkml/2010/8/17/234 for virtio - Paolo ]
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com

For my education, why remove the FUA stuff? The link you posted
was for virtio drivers, and says:

  Hmmm... the underlying storage could be md/dm RAIDs in which case FUA
  should be cheaper than FLUSH.

 If someone ever wrote a virtio-blk backend that sits directly ontop
 of the Linux block layer that would be true.

In this case we don't know what the backend is sitting on top of
a-priori. It might be the current nbd server code, but it might
not be.

-- 
Alex Bligh

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-12 Thread Alex Bligh

Paolo,

On 12 Feb 2013, at 18:06, Paolo Bonzini wrote:

 Il 12/02/2013 18:37, Alex Bligh ha scritto:
 For my education, why remove the FUA stuff?
 
 Because I had no way to test it.

I think my mods to the official NBD code support FUA (albeit not very 
efficiently)

 Hmmm... the underlying storage could be md/dm RAIDs in which case FUA
 should be cheaper than FLUSH.
 
 If someone ever wrote a virtio-blk backend that sits directly ontop
 of the Linux block layer that would be true.
 In this case we don't know what the backend is sitting on top of
 a-priori. It might be the current nbd server code, but it might
 not be.
 
 Do you know of any other NBD server than the official one and qemu-nbd?


Yes. I know one well (but it's not open source). NBD seems to be the 'goto 
protocol' for writing distributed block store drivers in user space.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] nbd: support FLUSH requests

2013-02-12 Thread Alex Bligh


On 12 Feb 2013, at 21:32, Andrew Morton wrote:

 
 Obviously the changelog was inadequate.  Please send along a new one
 which fully describes the reasons for this change.

To be clear I have no complaints about the rest of the patch being
merged. Supporting FLUSH but not FUA is far better than supporting
neither. I just didn't understand dropping FUA given the semantics
of nbd is in essence 'linux bios over tcp'.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Local DoS through write heavy I/O on CFQ & Deadline

2012-10-13 Thread Alex Bligh




--On 13 October 2012 21:53:09 +0800 Hillf Danton  wrote:


Take a look at the "wait for writeback" problem please.

Linux 3.0+ Disk performance problem - wrong pdflush behaviour
https://lkml.org/lkml/2012/10/10/412


I'm guessing that's related but may not be the whole story. My
test case is rather simpler, and Viktor says that with the
patch causing his regression reverted, "After I've set the dirty_bytes
over the file size the writes are never blocked.". That suggests
to me that in order to avoid write blocking he needs dirty_bytes
larger than the file size. As the bytes written in my test case
exceed RAM, that's going to be be an issue as dirty_bytes is always
going to be hit; I think it Viktor's case he is trying to avoid
it being hit at all.

Or perhaps I have the wrong end of the stick.

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Local DoS through write heavy I/O on CFQ Deadline

2012-10-13 Thread Alex Bligh




--On 13 October 2012 21:53:09 +0800 Hillf Danton dhi...@gmail.com wrote:


Take a look at the wait for writeback problem please.

Linux 3.0+ Disk performance problem - wrong pdflush behaviour
https://lkml.org/lkml/2012/10/10/412


I'm guessing that's related but may not be the whole story. My
test case is rather simpler, and Viktor says that with the
patch causing his regression reverted, After I've set the dirty_bytes
over the file size the writes are never blocked.. That suggests
to me that in order to avoid write blocking he needs dirty_bytes
larger than the file size. As the bytes written in my test case
exceed RAM, that's going to be be an issue as dirty_bytes is always
going to be hit; I think it Viktor's case he is trying to avoid
it being hit at all.

Or perhaps I have the wrong end of the stick.

--
Alex Bligh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Local DoS through write heavy I/O on CFQ & Deadline

2012-10-12 Thread Alex Bligh


Michael,

--On 12 October 2012 16:58:39 +0200 Michal Hocko  wrote:


Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which
writes gets throttled. If this is not the case then there is a bug in
the throttling code.


I believe that is the problem.


Isn't the only thing that is going to change that it ends up
triggering the writeback earlier?


Set the limit lowe?


I think you mean 'lower'. If I do that, what I think will happen
is that it will start the write-back earlier, but the writeback
once started will not keep up with the generation of data, possibly
because the throttling isn't going to work. Note that for
instance using ionice to set priority or class to 'idle'
has no effect. So, to test my hypothesis ...


Happy to test etc - what would you suggest, dirty_ratio=5,
dirty_background_ratio=2 ?


These are measured in percentage. On the other hand if you use
dirty_bytes resp. dirty_background_bytes then you get absolute numbers
independent on the amount of memory.


... what would you suggest I set any of these to in order to test
(assuming the same box) so that it's 'low enough' that if it still
hangs, it's a bug, rather than it's simply 'not low enough'. It's
an 8G box and clearly I'm happy to set either the _ratio or _bytes
entries.

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Local DoS through write heavy I/O on CFQ & Deadline

2012-10-12 Thread Alex Bligh




--On 12 October 2012 15:30:45 +0200 Michal Hocko  wrote:


Full info, including logs and scripts can be found at:
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521


You seem to have 8G of RAM and dirty_ratio=20 resp.
dirty_background_ratio=10 which means that 1.5G worth of dirty data
until writer gets throttled which is a lot. Background writeback starts
at 800M which is probably not sufficient as well. Have you tried to set
dirty_bytes at a reasonable value (wrt. to your storage)?


This is for an appliance install where we have no idea how much
memory the box has in advance other than 'at least 4G' so it
is difficult to tune by default.

However, I don't think that would solve the problem as the zcat/dd
can always generate data faster than it can be written to disk unless
or until it is throttled, which it never is. Isn't the only thing that
is going to change that it ends up triggering the writeback earlier?

Happy to test etc - what would you suggest, dirty_ratio=5,
dirty_background_ratio=2 ?

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Local DoS through write heavy I/O on CFQ & Deadline

2012-10-12 Thread Alex Bligh


Alan,

--On 11 October 2012 14:46:34 +0100 Alan Cox  
wrote:



We have reproduced this on multiple hardware environments, using 3.2
(/proc/version_signature gives "Ubuntu 3.2.0-29.46-generic 3.2.24").
Anecdotally we believe the situation has worsened since 3.0.


I've certainly seen this on 3.0 and 3.2, but do you still see it on
3.5/6 ?


We've just tested this. We see exactly the same issue on 3.6.1 using
the current build of the Ubuntu Quantal kernel, which is:

Linux version 3.6.1-030601-generic (apw@gomeisa) (gcc version 4.6.3 
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201210071322 SMP Sun Oct 7 17:23:28 UTC 
2012


More details (including full logs for that kernel) at:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521

--
Alex Bligh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Local DoS through write heavy I/O on CFQ Deadline

2012-10-12 Thread Alex Bligh


Alan,

--On 11 October 2012 14:46:34 +0100 Alan Cox a...@lxorguk.ukuu.org.uk 
wrote:



We have reproduced this on multiple hardware environments, using 3.2
(/proc/version_signature gives Ubuntu 3.2.0-29.46-generic 3.2.24).
Anecdotally we believe the situation has worsened since 3.0.


I've certainly seen this on 3.0 and 3.2, but do you still see it on
3.5/6 ?


We've just tested this. We see exactly the same issue on 3.6.1 using
the current build of the Ubuntu Quantal kernel, which is:

Linux version 3.6.1-030601-generic (apw@gomeisa) (gcc version 4.6.3 
(Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201210071322 SMP Sun Oct 7 17:23:28 UTC 
2012


More details (including full logs for that kernel) at:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521

--
Alex Bligh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Local DoS through write heavy I/O on CFQ Deadline

2012-10-12 Thread Alex Bligh




--On 12 October 2012 15:30:45 +0200 Michal Hocko mho...@suse.cz wrote:


Full info, including logs and scripts can be found at:
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521


You seem to have 8G of RAM and dirty_ratio=20 resp.
dirty_background_ratio=10 which means that 1.5G worth of dirty data
until writer gets throttled which is a lot. Background writeback starts
at 800M which is probably not sufficient as well. Have you tried to set
dirty_bytes at a reasonable value (wrt. to your storage)?


This is for an appliance install where we have no idea how much
memory the box has in advance other than 'at least 4G' so it
is difficult to tune by default.

However, I don't think that would solve the problem as the zcat/dd
can always generate data faster than it can be written to disk unless
or until it is throttled, which it never is. Isn't the only thing that
is going to change that it ends up triggering the writeback earlier?

Happy to test etc - what would you suggest, dirty_ratio=5,
dirty_background_ratio=2 ?

--
Alex Bligh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Local DoS through write heavy I/O on CFQ Deadline

2012-10-12 Thread Alex Bligh


Michael,

--On 12 October 2012 16:58:39 +0200 Michal Hocko mho...@suse.cz wrote:


Once dirty_ratio (resp. dirty_bytes) limit is hit then the process which
writes gets throttled. If this is not the case then there is a bug in
the throttling code.


I believe that is the problem.


Isn't the only thing that is going to change that it ends up
triggering the writeback earlier?


Set the limit lowe?


I think you mean 'lower'. If I do that, what I think will happen
is that it will start the write-back earlier, but the writeback
once started will not keep up with the generation of data, possibly
because the throttling isn't going to work. Note that for
instance using ionice to set priority or class to 'idle'
has no effect. So, to test my hypothesis ...


Happy to test etc - what would you suggest, dirty_ratio=5,
dirty_background_ratio=2 ?


These are measured in percentage. On the other hand if you use
dirty_bytes resp. dirty_background_bytes then you get absolute numbers
independent on the amount of memory.


... what would you suggest I set any of these to in order to test
(assuming the same box) so that it's 'low enough' that if it still
hangs, it's a bug, rather than it's simply 'not low enough'. It's
an 8G box and clearly I'm happy to set either the _ratio or _bytes
entries.

--
Alex Bligh
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Local DoS through write heavy I/O on CFQ & Deadline

2012-10-11 Thread Alex Bligh

We have noticed significant I/O scheduling issues on both the CFQ and the
deadline scheduler where a non-root user can starve any other process of
any I/O for minutes at a time. The problem is more serious using CFQ but is
still an effective local DoS vector using Deadline.

A simple way to generate the problem is:

  dd if=/dev/zero of=- bs=1M count=5 | dd if=- of=myfile bs=1M count=5

(note use of 2 dd's is to avoid alleged optimisation of the writing dd
from /dev/zero). zcat-ing a large file with stout redirected to a file
produces a similar error. Using ionice to set idle priority makes no
difference.

To instrument the problem we produced a python script which does a MySQL
select and update every 10 seconds, and time the execution of the update.
This is normally milliseconds, but under user generated load conditions, we
can take this to indefinite (on CFQ) and over a minute (on deadline).
Postgres is affected in a similar manner (i.e. it is not MySQL specific).
Simultaneously we have captured the output of 'vmstat 1 2' and
/proc/meminfo, with appropriate timestamps.

We have reproduced this on multiple hardware environments, using 3.2
(/proc/version_signature gives "Ubuntu 3.2.0-29.46-generic 3.2.24").
Anecdotally we believe the situation has worsened since 3.0.

We believe the problem is that dirty pages writeout is starving the system
of any other I/O, and whilst the process concerned should be penalised to
allow other processes I/O time, this is not happening.

Full info, including logs and scripts can be found at:
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521

I believe this represents a local DoS vector as an unprivileged user can
effectively stall any root owned process that is performing I/O.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Local DoS through write heavy I/O on CFQ Deadline

2012-10-11 Thread Alex Bligh

We have noticed significant I/O scheduling issues on both the CFQ and the
deadline scheduler where a non-root user can starve any other process of
any I/O for minutes at a time. The problem is more serious using CFQ but is
still an effective local DoS vector using Deadline.

A simple way to generate the problem is:

  dd if=/dev/zero of=- bs=1M count=5 | dd if=- of=myfile bs=1M count=5

(note use of 2 dd's is to avoid alleged optimisation of the writing dd
from /dev/zero). zcat-ing a large file with stout redirected to a file
produces a similar error. Using ionice to set idle priority makes no
difference.

To instrument the problem we produced a python script which does a MySQL
select and update every 10 seconds, and time the execution of the update.
This is normally milliseconds, but under user generated load conditions, we
can take this to indefinite (on CFQ) and over a minute (on deadline).
Postgres is affected in a similar manner (i.e. it is not MySQL specific).
Simultaneously we have captured the output of 'vmstat 1 2' and
/proc/meminfo, with appropriate timestamps.

We have reproduced this on multiple hardware environments, using 3.2
(/proc/version_signature gives Ubuntu 3.2.0-29.46-generic 3.2.24).
Anecdotally we believe the situation has worsened since 3.0.

We believe the problem is that dirty pages writeout is starving the system
of any other I/O, and whilst the process concerned should be penalised to
allow other processes I/O time, this is not happening.

Full info, including logs and scripts can be found at:
  https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1064521

I believe this represents a local DoS vector as an unprivileged user can
effectively stall any root owned process that is performing I/O.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Controversy over dynamic linking -- how to end the panic (long)

2001-06-21 Thread Alex Bligh - linux-kernel

rights of distribution of the GPL'd code if
   you distribute stuff yourself. This is set out clearly
   in the preamble.

Areas of doubt:

A. Are all contributed patches / files /always/ under a license
   no more restrictive than 'COPYING' - including Linus'
   preamble.

B. Is it clear that inclusion of kernel headers when compiling
   a progam does not make that program a derived work of the
   kernel.

Implications:

A. It would be worth Linus publicising the fact (or some URL
   be added to Linux kernel postings) that patches by default
   will be assumed to be under the 'COPYING' license.

B. It would be worth someone clearing up the status of the
   license on header files.

C. It would be preferable if people read the COPYING file
   before commenting on license issues.

--
Alex Bligh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Controversy over dynamic linking -- how to end the panic (long)

2001-06-21 Thread Alex Bligh - linux-kernel

   in the preamble.

Areas of doubt:

A. Are all contributed patches / files /always/ under a license
   no more restrictive than 'COPYING' - including Linus'
   preamble.

B. Is it clear that inclusion of kernel headers when compiling
   a progam does not make that program a derived work of the
   kernel.

Implications:

A. It would be worth Linus publicising the fact (or some URL
   be added to Linux kernel postings) that patches by default
   will be assumed to be under the 'COPYING' license.

B. It would be worth someone clearing up the status of the
   license on header files.

C. It would be preferable if people read the COPYING file
   before commenting on license issues.

--
Alex Bligh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: LANANA: To Pending Device Number Registrants

2001-05-16 Thread Alex Bligh - linux-kernel


> I believe thats why there are persistant superblocks on the RAID
> partitions. You can switch them around, and it still knows which drive
> holds which RAID partition...  That's the only way booting off RAID
> works, and the only reason for the "RAID Autodetect" partition type...
> you can find those shuffled partitions correctly.  The only time it
> really looks at the file, is if you try to rebuild the partition I
> believe... and some other circumstance that dosn't come to mind.

OK. I obviously asked slightly the wrong question
as I was talking about a theoretical case and chose
badly.

What I meant was 'Wouldn't it be better to use some form of
static identifier / volume serial / MAC address / whatever, which
/dev/[device-type]/[ID] exactly identifies, and continues to
identify, even across config changes, than (a) go look
for these ID's, then (b) sequence them in number disk0..N so
that if one is removed, all the rest of them shuffle down'.
IE isn't the point being argued here: 'don't identify
devices by major/minor device number pairs chosen on
a rather arbitrary and wasteful basis which describes how
the given device is physically connected' and not 'the
devices shall be dynamically ordered 0..N and rely
on the order of detection, and the number of
other devices, to determine their numbering'.

As has been pointed out elsewhere, if you expose
/dev/disk/0/serial-number and/or
/dev/disk/0/interface/scsi/ whatever, in a
format trivially readable by a script, the problem
largely goes away (+/- a linear search). However,
if you don't at least try to keep (say) disk
names static across reboot with trivial
reconfig change (like inserting a PCMCIA
flash card), I can see people are going to
need to rewrite stuff like mount to look at
(say) an fstab where the first field is
not /dev/hda0, it is something which identifies
a disk beyond /dev/disk/0 and actually goes looks
at the tree under that for some form of serial
number (or whatever), or mount -a will produce
surprising results (for instance there's no
reason a PCMCIA flash card might not be
detected before a PCMCIA HD). I understood 2.4
was meant to be pretty static in terms of
external interface.

--
Alex Bligh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: LANANA: To Pending Device Number Registrants

2001-05-16 Thread Alex Bligh - linux-kernel


 I believe thats why there are persistant superblocks on the RAID
 partitions. You can switch them around, and it still knows which drive
 holds which RAID partition...  That's the only way booting off RAID
 works, and the only reason for the RAID Autodetect partition type...
 you can find those shuffled partitions correctly.  The only time it
 really looks at the file, is if you try to rebuild the partition I
 believe... and some other circumstance that dosn't come to mind.

OK. I obviously asked slightly the wrong question
as I was talking about a theoretical case and chose
badly.

What I meant was 'Wouldn't it be better to use some form of
static identifier / volume serial / MAC address / whatever, which
/dev/[device-type]/[ID] exactly identifies, and continues to
identify, even across config changes, than (a) go look
for these ID's, then (b) sequence them in number disk0..N so
that if one is removed, all the rest of them shuffle down'.
IE isn't the point being argued here: 'don't identify
devices by major/minor device number pairs chosen on
a rather arbitrary and wasteful basis which describes how
the given device is physically connected' and not 'the
devices shall be dynamically ordered 0..N and rely
on the order of detection, and the number of
other devices, to determine their numbering'.

As has been pointed out elsewhere, if you expose
/dev/disk/0/serial-number and/or
/dev/disk/0/interface/scsi/ whatever, in a
format trivially readable by a script, the problem
largely goes away (+/- a linear search). However,
if you don't at least try to keep (say) disk
names static across reboot with trivial
reconfig change (like inserting a PCMCIA
flash card), I can see people are going to
need to rewrite stuff like mount to look at
(say) an fstab where the first field is
not /dev/hda0, it is something which identifies
a disk beyond /dev/disk/0 and actually goes looks
at the tree under that for some form of serial
number (or whatever), or mount -a will produce
surprising results (for instance there's no
reason a PCMCIA flash card might not be
detected before a PCMCIA HD). I understood 2.4
was meant to be pretty static in terms of
external interface.

--
Alex Bligh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LANANA: To Pending Device Number Registrants

2001-05-15 Thread Alex Bligh - linux-kernel


> The argument that "if you use numbering based on where in the SCSI chain
> the disk is, disks don't pop in and out" is absolute crap. It's not true
> even for SCSI any more (there are devices that will aquire their location
> dynamically), and it has never been true anywhere else. Give it up.

Q: Let us assume you have dynamic numbering disk0..N as you suggest,
   and you have some s/w RAID of SCSI disks. A disk fails, and is (hot)
   removed. Life continues. You reboot the machine. Disks are now numbered
   disk0..(N-1). If the RAID config specifies using disk0..N thusly, it
   is going to be very confused, as stripes will appear in the wrong place.
   Doesn't that mean the file specifying the RAID config is going to have
   to enumerate SCSI IDs (or something configuration invariant) as
   opposed to use the disk0..N numbering anyway? Sure it can interrogate
   each disk0..N to see which has the ID that it actually wanted, but
   doesn't this rather subvert the purpose?

IE, given one could create /dev/disk/?.+, isn't the important
argument that they share common major device numbers etc., not whether
they linearly reorder precisely to 0..N as opposed to have some form
of identifier guaranteed to be static across reboot & config change.
--
Alex Bligh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: LANANA: To Pending Device Number Registrants

2001-05-15 Thread Alex Bligh - linux-kernel


 The argument that if you use numbering based on where in the SCSI chain
 the disk is, disks don't pop in and out is absolute crap. It's not true
 even for SCSI any more (there are devices that will aquire their location
 dynamically), and it has never been true anywhere else. Give it up.

Q: Let us assume you have dynamic numbering disk0..N as you suggest,
   and you have some s/w RAID of SCSI disks. A disk fails, and is (hot)
   removed. Life continues. You reboot the machine. Disks are now numbered
   disk0..(N-1). If the RAID config specifies using disk0..N thusly, it
   is going to be very confused, as stripes will appear in the wrong place.
   Doesn't that mean the file specifying the RAID config is going to have
   to enumerate SCSI IDs (or something configuration invariant) as
   opposed to use the disk0..N numbering anyway? Sure it can interrogate
   each disk0..N to see which has the ID that it actually wanted, but
   doesn't this rather subvert the purpose?

IE, given one could create /dev/disk/?.+, isn't the important
argument that they share common major device numbers etc., not whether
they linearly reorder precisely to 0..N as opposed to have some form
of identifier guaranteed to be static across reboot  config change.
--
Alex Bligh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-08 Thread Alex Bligh - linux-kernel


>   The real fix is to measure fragmentation and the progress of kswapd, but
> that is too drastic for 2.4.x.

I suspect the real fix might, in general, be
a) to reduce use of kmalloc() etc. which gives
   physically contiguous memory, where virtually
   contiguous memory will do (and is, presumably,
   far easier to come by). (or perhaps add some
   flag to kmalloc to allocate out of virtual
   rather than physical memory).
b) to bias flush or swap out routines to create
   physically contiguous higher order blocks.
   Many heuristics will give you that ability.

Disclaimer: I haven't looked at this for issue for years,
but Linux seems to fail on >4k allocations now, and
fragment memory far more, than it did on much smaller
systems doing lots of nasty (8k, thus 3 pages including
header) NFS stuff back in 94.

--
Alex Bligh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] allocation looping + kswapd CPU cycles

2001-05-08 Thread Alex Bligh - linux-kernel


   The real fix is to measure fragmentation and the progress of kswapd, but
 that is too drastic for 2.4.x.

I suspect the real fix might, in general, be
a) to reduce use of kmalloc() etc. which gives
   physically contiguous memory, where virtually
   contiguous memory will do (and is, presumably,
   far easier to come by). (or perhaps add some
   flag to kmalloc to allocate out of virtual
   rather than physical memory).
b) to bias flush or swap out routines to create
   physically contiguous higher order blocks.
   Many heuristics will give you that ability.

Disclaimer: I haven't looked at this for issue for years,
but Linux seems to fail on 4k allocations now, and
fragment memory far more, than it did on much smaller
systems doing lots of nasty (8k, thus 3 pages including
header) NFS stuff back in 94.

--
Alex Bligh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl arg passing

2001-04-23 Thread Alex Bligh - linux-kernel


> And there is very low chance that kmalloc() for
> anything bigger than 4KB will succeed. You should either use
> vmalloc unconditionally, or at least as fallback.

The phrase 'very low chance' is inaccurate.

How do you think NFS works with -rsize, -wsize > 4096?
kmalloc() uses get_free_pages() to allocate
more than one physically contiguous memory
page, which in turn uses a sort of modified
buddy system to distribute them. And if you
specify the right GFP_ flags, will page out,
if necessary, to do so. (I know the hard way
as this is the one substantial thing I fixed
in the linux kernel just after 1.0 in about 95).

If you need physically (as opposed to virtially)
contiguous memory, unless lots has changed since then,
kmalloc() is the right call. However, you are
correct that it draws on scarce resources.

--
Alex Bligh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ioctl arg passing

2001-04-23 Thread Alex Bligh - linux-kernel


 And there is very low chance that kmalloc() for
 anything bigger than 4KB will succeed. You should either use
 vmalloc unconditionally, or at least as fallback.

The phrase 'very low chance' is inaccurate.

How do you think NFS works with -rsize, -wsize  4096?
kmalloc() uses get_free_pages() to allocate
more than one physically contiguous memory
page, which in turn uses a sort of modified
buddy system to distribute them. And if you
specify the right GFP_ flags, will page out,
if necessary, to do so. (I know the hard way
as this is the one substantial thing I fixed
in the linux kernel just after 1.0 in about 95).

If you need physically (as opposed to virtially)
contiguous memory, unless lots has changed since then,
kmalloc() is the right call. However, you are
correct that it draws on scarce resources.

--
Alex Bligh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Sources of entropy - /dev/random problem for network servers

2001-04-09 Thread Alex Bligh - linux-kernel


Jeff et al.,

>> However, only 3 drivers in drivers/net actually set
>> SA_SAMPLE_RANDOM when calling request_irq(). I believe
>> all of them should.
>
> No, because an attacker can potentially control input and make it
> non-random.

Point taken but:
1. In that case, if we follow your logic, no drivers (rather than 3
   arbitrary ones) should set SA_SAMPLE_RANDOM
2. Given that otherwise in at least my application (and machine
   without keyboard and mouse can't be too uncommon) there is *no*
   entropy otherwise, which is rather easier for a hacker. At least
   with entropy from a (albeit predictable) network, his attack
   is going to shift the state of the seed (in an albeit predictable
   manner), though he's going to have to make some accurate guesses
   each time about the forwarding delay of the router in front of it,
   and the stream of other packets hitting the server. I'd rather rely
   on this, than rely on cron (which is effectively what is driving
   any disk entropy every few minutes and is extremely predictable).

--
Alex Bligh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Sources of entropy - /dev/random problem for network servers

2001-04-09 Thread Alex Bligh - linux-kernel


Jeff et al.,

 However, only 3 drivers in drivers/net actually set
 SA_SAMPLE_RANDOM when calling request_irq(). I believe
 all of them should.

 No, because an attacker can potentially control input and make it
 non-random.

Point taken but:
1. In that case, if we follow your logic, no drivers (rather than 3
   arbitrary ones) should set SA_SAMPLE_RANDOM
2. Given that otherwise in at least my application (and machine
   without keyboard and mouse can't be too uncommon) there is *no*
   entropy otherwise, which is rather easier for a hacker. At least
   with entropy from a (albeit predictable) network, his attack
   is going to shift the state of the seed (in an albeit predictable
   manner), though he's going to have to make some accurate guesses
   each time about the forwarding delay of the router in front of it,
   and the stream of other packets hitting the server. I'd rather rely
   on this, than rely on cron (which is effectively what is driving
   any disk entropy every few minutes and is extremely predictable).

--
Alex Bligh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Sources of entropy - /dev/random problem for network servers

2001-04-08 Thread Alex Bligh - linux-kernel


In debugging why my (unloaded) IMAP server takes many seconds
to open folders, I discovered what looks like a problem
in 2.4's feeding of entropy into /dev/random. When there
is insufficient entropy in the random number generator,
reading from /dev/random blocks for several seconds. /dev/random
is used (correctly) for crytographic key verification.

Entropy comes from 4 sources it seems: Keyboard, Mouse, Disk I/O
and IRQs.

The machine in question is locked in a data center (can't be
the only one) and thus sees none of the former two. IDE Entropy
comes from executed IDE commands. The disk is physically largely
inactive due to caching. But there's plenty of network traffic
which should generate IRQs.

However, only 3 drivers in drivers/net actually set
SA_SAMPLE_RANDOM when calling request_irq(). I believe
all of them should. And indeed this fixed the problem for
me using an eepro100().

The following patch fixes eepro100.c - others can be
patched similarly.

--
Alex Bligh

/usr/src/linux# diff -C3 drivers/net/eepro100.c{.keep,}
*** drivers/net/eepro100.c.keep Tue Feb 13 21:15:05 2001
--- drivers/net/eepro100.c  Sun Apr  8 22:17:00 2001
***
*** 923,929 
sp->in_interrupt = 0;

/* .. we can safely take handler calls during init. */
!   retval = request_irq(dev->irq, _interrupt, SA_SHIRQ, 
dev->name, dev);
if (retval) {
MOD_DEC_USE_COUNT;
return retval;
--- 923,929 
sp->in_interrupt = 0;

/* .. we can safely take handler calls during init. */
!   retval = request_irq(dev->irq, _interrupt, SA_SHIRQ | 
SA_SAMPLE_RANDOM, dev->name, dev);
if (retval) {
MOD_DEC_USE_COUNT;
return retval;

[ENDS]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Sources of entropy - /dev/random problem for network servers

2001-04-08 Thread Alex Bligh - linux-kernel


In debugging why my (unloaded) IMAP server takes many seconds
to open folders, I discovered what looks like a problem
in 2.4's feeding of entropy into /dev/random. When there
is insufficient entropy in the random number generator,
reading from /dev/random blocks for several seconds. /dev/random
is used (correctly) for crytographic key verification.

Entropy comes from 4 sources it seems: Keyboard, Mouse, Disk I/O
and IRQs.

The machine in question is locked in a data center (can't be
the only one) and thus sees none of the former two. IDE Entropy
comes from executed IDE commands. The disk is physically largely
inactive due to caching. But there's plenty of network traffic
which should generate IRQs.

However, only 3 drivers in drivers/net actually set
SA_SAMPLE_RANDOM when calling request_irq(). I believe
all of them should. And indeed this fixed the problem for
me using an eepro100().

The following patch fixes eepro100.c - others can be
patched similarly.

--
Alex Bligh

/usr/src/linux# diff -C3 drivers/net/eepro100.c{.keep,}
*** drivers/net/eepro100.c.keep Tue Feb 13 21:15:05 2001
--- drivers/net/eepro100.c  Sun Apr  8 22:17:00 2001
***
*** 923,929 
sp-in_interrupt = 0;

/* .. we can safely take handler calls during init. */
!   retval = request_irq(dev-irq, speedo_interrupt, SA_SHIRQ, 
dev-name, dev);
if (retval) {
MOD_DEC_USE_COUNT;
return retval;
--- 923,929 
sp-in_interrupt = 0;

/* .. we can safely take handler calls during init. */
!   retval = request_irq(dev-irq, speedo_interrupt, SA_SHIRQ | 
SA_SAMPLE_RANDOM, dev-name, dev);
if (retval) {
MOD_DEC_USE_COUNT;
return retval;

[ENDS]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

88 matches

Mail list logo